¶

Exploring Key Factors Driving Parents to Immunized Their Children in India using Regression Analysis

by Mochammad Miftahul Fahmi (5972035)

Figure: Image of Immunization of a baby in India (Source: UNICEF).

The program will be structured as below¶

  • Part 01: Download Data and Initial Data Checking

  • Part 02: Research Formulation

  • Part 03: Data Cleaning and Preparation

  • Part 04: Exploratory Data Analysis: Regression

  • Part 05: Exploratory Spatial Data Analysis: Spatial Regression

  • Part 06: Conclusions and Reflection

WHAT WILL WE DO HERE?

This code is the continued version of this notebook here .

What is our agenda? I will use mainly Pandas for data cleaning and reshaping, Matplotlib is used for plot visualization, Seaborn for spatial visualization, and Lastly, because the research here will mainly focus on Inference (find the variables that affect heavily on the immunization level), thus statsmodel will be primarily used for regression analysis. However, train-test split will also be utilized using sklearn.

What is the Difference from previous RESEARCH (VISUALIZATION PART) ¶

Note: the goal of this research is still the same with previous research: primarily about 'immunization level (no of immunization done) on children age 0-3 years in India'.

  1. As a sneak peek, this is mostly highlight the story, especially how we will narrow down certain affecting variables.
  2. The hypotheses on the preliminary research was limited to some 'already chosen' variables based on theoretical framework and assumptions, also for the sake of less complexity. Here, as we have already equipped with ML tools (Regression), I will tell the story more by firtsly broaden the variables and do some experiment on some models based on the performance parameter of the regression result, and finally choose the appropriate variables. Thus, the story is more complex here. Note that here we will be more inference than prediction.
  3. The H4 (H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years)), in my humble opinion, resulting in bias because I calculated the entire total area in a district. To make it more accurate, actually we should calculate habitable area, but thus will quite complex for this assignment. Therefore, now we will use the data that is available from the SHRUG or even OGD (Open Government Data) of India.

Cheers and Happy New Year! 🍻🥳


Part 01: Initial Data Check ¶

In this part, main activities are:

  1. Declare the library. I will use mainly Pandas for data cleaning and reshaping, Matplotlib is used for plot visualization, Seaborn for spatial visualization, and Lastly, because the research here will mainly focus on Inference (find the variables that affect heavily on the immunization level), thus statsmodel will be primarily used for regression analysis. However, train-test split will also be utilized using sklearn.
  2. Read the csv file using panda read_csv. The data is located in the ./data folder
  3. Check at a glimpse of the data: head.
  4. Check the number of row and columns using "shape"
  5. Check unique column (variables) and row (index) using "unique"
In [126]:
# Declaration of Useful Library

# For dataframe processing (shp, csv)
import pandas as pd
import geopandas as gpd
import numpy as np
import os
from libpysal.weights import Queen
from pysal.lib import weights
from pysal.explore import esda
from splot.esda import moran_scatterplot, lisa_cluster, plot_local_autocorrelation

# For regression and EDA
from scipy.stats import pearsonr
from scipy.stats import mode

# statsmodel
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.outliers_influence import variance_inflation_factor

## sklearn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# For mapping geographical features
import osmnx as ox
import seaborn as sns
import contextily as ctx
import mapclassify

# For plotting
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from shapely.geometry import Point

Read the CSV file using read_csv. The csv was downloaded from open source Mission Antyodaya Village Facilities in SHRUG (https://www.devdatalab.org/shrug_download/). Information can be found here .

The analysis will be focused on "Subdistrict" and "District" level because of some reasons (explained on the Reflection at the bottom of Part 01).

In [2]:
# Read CSV file, at desired level
initial_data = pd.read_csv("./data/shrug-antyodaya-csv/antyodaya_pc11subdist.csv")
# The content of the CSV will be explained on Part 02

# Check the data at a glimpse using head
initial_data.head()
Out[2]:
pc11_state_id pc11_district_id pc11_subdistrict_id total_population male_population female_population total_hhd total_hhd_engaged_in_farm_activi total_hhd_engaged_in_non_farm_ac is_govt_seed_centre_available ... open_uncovered_drainage open_kuchha_drainage fpo no_electricity internal_pucca_road livestock_ext_services _mean_p_miss _core_p_miss _target_weight_share _target_group_max_weight_share
0 3 41 225 778.0 407.0 371.0 155.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.00000 0.00000 0.0 0.0 0.0 1.0 1.0
1 12 245 1563 446.0 216.0 230.0 95.0 40.0 45.0 0.0 ... 0.0 1.0 0.0 0.00000 0.00000 0.0 0.0 0.0 1.0 1.0
2 12 250 1626 603.0 253.0 350.0 129.0 110.0 3.0 0.0 ... 0.0 0.0 0.0 0.45937 0.54063 0.0 0.0 0.0 1.0 1.0
3 12 253 1679 1380.0 828.0 552.0 228.0 120.0 70.0 0.0 ... 0.0 1.0 0.0 0.00000 0.00000 1.0 0.0 0.0 1.0 1.0
4 21 371 2774 297.0 155.0 142.0 70.0 55.0 15.0 0.0 ... 0.0 0.0 0.0 0.00000 1.00000 0.0 0.0 0.0 1.0 1.0

5 rows × 171 columns

In [3]:
# Review the number of records (row) and columns
print("Check the Number of Rows and Columns")
num_rows, num_columns = initial_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns (variables): {num_columns}")
Check the Number of Rows and Columns
Number of rows: 5480
Number of columns (variables): 171

Next, get the name of each column of the initial dataframe. It will be handy for us, the analyst, because we will need to know the title of the needed columns.

In [4]:
# Get the name of the column (column title)
column_names = list(initial_data.columns)
print("The name of the columns of the data:" )

for i, column_name in enumerate(column_names, start=1):
    print(f"{i}. {column_name}")
The name of the columns of the data:
1. pc11_state_id
2. pc11_district_id
3. pc11_subdistrict_id
4. total_population
5. male_population
6. female_population
7. total_hhd
8. total_hhd_engaged_in_farm_activi
9. total_hhd_engaged_in_non_farm_ac
10. is_govt_seed_centre_available
11. availability_of_watershed_dev_pr
12. availability_of_rain_harvest_sys
13. availability_of_food_storage_war
14. availability_of_farm_gate_proces
15. availability_of_custom_hiring_ce
16. total_cultivable_area_in_hac
17. net_sown_area_in_hac
18. net_sown_area_kharif_in_hac
19. net_sown_area_rabi_in_hac
20. net_sown_area_other_in_hac
21. is_soil_testing_centre_available
22. is_fertilizer_shop_available
23. no_of_farmers_using_drip_sprinkl
24. area_irrigated_in_hac
25. total_unirrigated_land_area_in_h
26. availability_of_milk_routes
27. availability_of_poultry_dev_proj
28. availability_of_goatary_dev_proj
29. availability_of_pigery_developme
30. is_veterinary_hospital_available
31. availability_of_fish_community_p
32. availability_of_fish_farming
33. availability_of_aquaculture_ext_
34. total_hhd_with_kuccha_wall_kucch
35. total_hhd_have_got_pmay_house
36. total_hhd_in_pmay_permanent_wait
37. total_hhd_got_benefit_under_stat
38. total_hhd_in_perm_waitlist_under
39. piped_water_fully_covered
40. is_village_connected_to_all_weat
41. public_transport
42. availability_of_railway_station
43. total_hhd_having_pmsbhgy_benefit
44. availability_of_elect_supply_to_
45. availability_of_solor_wind_energ
46. total_hhd_having_solor_wind_ener
47. availability_of_panchayat_bhawan
48. csc
49. public_info_board
50. is_common_pastures_available
51. total_hhd_availing_pmuy_benefits
52. availability_of_public_library
53. recreational_centre
54. is_bank_available
55. is_bank_buss_correspondent_with_
56. is_atm_available
57. total_hhd_availing_pmjdy_bank_ac
58. is_post_office_available
59. telephone_services
60. is_broadband_available
61. is_pds_available
62. total_hhd_having_bpl_cards
63. availability_of_primary_school
64. is_primary_school_with_electrici
65. is_primary_school_with_computer_
66. is_primary_school_with_playgroun
67. is_primary_school_have_drinking_
68. availability_of_mid_day_meal_sch
69. total_primary_school_students
70. total_primary_school_teachers
71. availability_of_middle_school
72. availability_of_high_school
73. availability_of_ssc_school
74. no_of_children_not_attending_sch
75. availability_of_govt_degree_coll
76. total_grd_and_pg_in_village
77. is_vocational_edu_centre_availab
78. total_trained_trainee_under_skil
79. availability_of_jan_aushadhi_ken
80. total_hhd_registered_under_pmjay
81. is_community_waste_disposal_syst
82. total_hhd_with_clean_energy
83. is_community_biogas_waste_recycl
84. is_aanganwadi_centre_available
85. is_early_childhood_edu_provided_
86. total_childs_aged_0_to_3_years
87. total_childs_aged_0_to_3_years_r
88. total_childs_aged_3_to_6_years_r
89. total_childs_aged_0_to_3_years_i
90. total_childs_categorized_non_stu
91. total_anemic_pregnant_women
92. total_anemic_adolescent_girls
93. total_underweight_child_age_unde
94. total_male_child_age_bw_0_6
95. total_female_child_age_bw_0_6
96. total_minority_children_getting_
97. total_minority_hh_provided_bank_
98. no_of_physically_challenged_recv
99. total_hhd_with_more_than_2_child
100. availability_of_mother_child_hea
101. total_hhd_availing_pension_under
102. total_shg
103. total_hhd_mobilized_into_shg
104. total_no_of_shg_promoted
105. total_hhd_mobilized_into_pg
106. total_shg_accessed_bank_loans
107. is_bee_farming
108. is_sericulture
109. is_handloom
110. is_handicrafts
111. availability_of_community_forest
112. availability_of_minor_forest_pro
113. total_hhd_source_of_minor_forest
114. availability_of_cottage_small_sc
115. total_hhd_engaged_cottage_small_
116. availability_of_adult_edu_centre
117. total_no_of_registered_children_
118. total_no_of_children_0_to_6_year
119. total_no_of_pregnant_women
120. total_no_of_pregnant_women_recei
121. total_no_of_lactating_mothers
122. total_no_of_lactating_mothers_re
123. total_no_of_women_delivered_babi
124. total_no_of_children_in_icds_cas
125. total_no_of_young_anemic_childre
126. total_no_of_newly_born_children
127. total_no_of_newly_born_underweig
128. total_hhd_not_having_sanitary_la
129. total_no_of_eligible_beneficiari
130. total_no_of_beneficiaries_receiv
131. gp_total_no_of_eligible_benefici
132. gp_total_no_of_beneficiaries_rec
133. gp_total_hhd_eligible_under_nfsa
134. gp_total_hhd_receiving_food_grai
135. total_no_of_farmers_registered_u
136. total_no_of_farmers_subscribed_a
137. total_no_of_farmers
138. total_no_of_farmers_received_ben
139. total_no_farmers_adopted_organic
140. total_no_of_farmers_add_fert_in_
141. total_no_of_elected_representati
142. total_no_of_elect_rep_oriented_u
143. total_no_of_elect_rep_undergone_
144. total_approved_labour_budget_for
145. total_expenditure_approved_under
146. total_area_covered_under_irrigat
147. total_hhd_having_piped_water_con
148. pc11_land_area
149. canal
150. surface_water
151. ground_water
152. other_irrigation
153. any_primary_sch_toilet
154. mandi
155. regular_market
156. weekly_haat
157. phc
158. chc
159. sub_centre
160. closed_drainage
161. open_covered_drainage
162. open_uncovered_drainage
163. open_kuchha_drainage
164. fpo
165. no_electricity
166. internal_pucca_road
167. livestock_ext_services
168. _mean_p_miss
169. _core_p_miss
170. _target_weight_share
171. _target_group_max_weight_share

Reflection of Part 01

Some important reflection based on the decision of which data will be the main set:

A. The main dataset chosen

The main dataset will be SHRUG in subdistrict level because of important reasons and logic: 1) To prevent bias, as the author (me) is not from local of the country (India), imagining space in subdistrict and district is easier because it is close with the definition of a big town (for example, Anjar subdistrict in Gujarat that has an area of 1404.89 km2). This will help us to understand more about amenities, as it is easier to imagine health facilities in a town instead of in a state. 2) Regarding the data analysis process:

  • if our main data is in district level, it will be difficult in case we need to zoom in or get more granular into subdistrict level
  • if our main data is in subdistrict level, it will be easier in case we need to zoom out or get more birdview into district level (with the help of pivot).

Hence, I will use "Subdistrict" level csv as the initial main data. Furthermore, for shp file, I will use also "District" level SHP because shp file from "Subdistrict" resulting in 'many islands' during process of creating Spatial Weight Matrix (this is part of Non Linear Process).

The shrid csv level is not chosen because the availability of the data. As you can see on the SHRUG Documentation, the shrid level has 11% of missing data, while subdistrict and district level have only 7% and 4% respectively. Subdistrict-level data also may be more readily available and accurate compared to finer levels such as villages and towns (shrid).

On other hand, state level is too broad. State-level data is frequently used for policy advocacy and reporting to higher authorities, however it lacks of nuance of ammenities case study.

However, the district level will still be analyzed, so we can compare cities between different districts. Also, some subdistrict geometry data result in many unintended island, thus we will still need the use gpkg in district level.

B. Inclusivity and Representativeness 1) Number of missing data: As explained above, it is wiser to have lesser data missing in a dataset (assuming that the number of represented district is equal). The number of missing data in shrid level is above 10%, compared to other available dataset. This missing can also be examined on the metadata in SHRUG (link: https://docs.devdatalab.org/SHRUG-Metadata/Mission%20Antyodaya%20Village%20Facilities%20%282020%29/Tables/antyodaya-metadata/)

2) The chosen main dataset: subdistrict level contains no missing total_population data in from 5444 subdistricts and 613 districts. Based on reference (https://en.wikipedia.org/wiki/Administrative_divisions_of_India), the data covers 5444 out of 6057 subdistricts (89%, 2018 data) and 613 out of 808 districts (76%, 2022 data), which already reached >75% of area in India.

3) When split the test-train, we use STRATIFIED based split so that every states in India is represented.

C. Other Reflections using CDS Framework

  • Inequality consideration: Since some subdistricts are not available, I wish to have more completed database. However, in this case, the database is basically one of the most complete ones (explained at A above) compared to other databases in SHRUG. To reduce the inequality and improve head-to-head comparison, I prevent to combine the data from different module that has different time of when the data was collected. Hence, I will focus solely on module Antodaya metadata.
  • Participation and Power consideration: The data was provided by researchers from India and lucky for us (very kind of them!), it is open source. Hence, we can use the data without forgetting to mention the source.
  • Positionality: In this research, the author (Fahmi) will have a researcher position, to always be objective on the outcome of the result and data. The truth of which factors amplify the level of immunized level, hopefully represent the voice of children and parents of Indian citizen.

Part 02: Research Formulation ¶

In this part, main activities are:

  1. Explanation of research background
  2. Hypothesis
  3. Variables and Construct
  4. Some IMPORTANT reflection, limitation, and notes!

Research Background

COVID-19 pandemic reminds us the importance of vaccine and immunization. Through immunization, our body will develop an immune foundation, thus preventing severe illnesses and contributing to herd immunity. Hence, the earlier the foundation build, the more we contribute to the herd immunity. Therefore, immunization for children is important. It safeguards against various and potentially life-threatening diseases. Protecting this vulnerable age group is vital for long-term public health and the well-being of the community.

This research is basically a preliminary research to find further factors driving immunization level among children aged 0-3 years. As initial start, there are some factors causing the immunization level, mainly economics, social, and political factors. Niranjan (2018)[1] underpins that women from the SHGs (Self Help Group; a community organization to help family and women financially) with health intervention were more likely to overall healthier practice, including providing age-approproate immunization

Kumar (2009)[2] found that educational level of women and husband, level of income, infrastructures impacts women participation of the SHGs. In this research, we will assume that those will also impact the level of immunization.

Research Objective

This program is part of research to understand further the driving factors impacting the level of immunization level on children aged 0-3 years. In the end, this research will narrow down which variables are likely to influence the level of immunization (also which variables have heavier impact to the output), allowing future research to focus more on important variables thus creating proper decision making.

Thus, in this research, we will focus more on INFERENCE instead of prediction

Scope of the Data

The SHRUG data of Mission Antyodaya Village Facilities contains large dataset of shrid, subdistrict, and district level of important factors related to facilities and socio-economics feature of India collected 2018-2020.

Some important sources:

[1] Saggurti, N., Atmavilas, Y., Porwal, A., Schooley, J., Das, R., Kande, N., Irani, L., & Hay, K. (2018). Effect of health intervention integration within women's self-help groups on collectivization and healthy practices around reproductive, maternal, neonatal and child health in rural India. PloS one, 13(8), e0202562. https://doi.org/10.1371/journal.pone.0202562

[2] Participation in Self-Help Group Activities and its Impacts: Evidence from South India (Online at https://mpra.ub.uni-muenchen.de/19943/ MPRA Paper No. 19943, posted 14 Jan 2010 00:06 UTC)

Hypothesis

These hypotheses is part of non-linear process, in which at the start of this research, the number of variables are much higher in number. However, one of the most relevant and high affecting variables are below:

#

H1: The number of graduates and post graduates (represents the education level) positively impacts the level of immunized children (age 0-3 years).

H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years)

H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy/ economy) positively impacts the level of immunized children (age 0-3 years)

H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years)

H5: The number of Pradhan Mantri Matru Vandana Yojana (PMMVY) beneficiaries (represents subsidy/ economy) positively impacts the level of immunized children (age 0-3 years)

H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization.

Hypothesis Explanation

  1. H1: The number of graduates and post graduates (represents the education level, of parents) positively impacts the level of immunized children (age 0-3 years). The education level will improve the cognitive ability and science-consiousness about the fact that immunization will improve immune system of our body (acknowledge that it is science based evidence). It is expected that without prior knowledge of the benefit of the immunization, people will also not willing to do that to their children.

  2. H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years). Self Help Groups contribute to community well-being, potentially boosting child immunization rates (age 0-3) through shared resources, awareness, and support. It is expected that if a bunch group of people do immunization to their children, the willingness to provide immunization will also increase.

  3. H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). This hypothesis builds on the idea that subsidies offered through ICDS contribute to improved accessibility, affordability, and awareness of healthcare services, including immunization.

  4. H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years). This hypothesis is occured because I think that a well-developed healthcare infrastructure facilitates increased immunization coverage. Higher number of health facilities per area are likely to provide better access to vaccination services, reducing barriers and encouraging parents to adhere to immunization schedules

  5. H5: The number of Pradhan Mantri Matru Vandana Yojana (PMMVY) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). Pradhan Mantri Matru Vandana Yojana is a maternity benefit program run by the government of India, having quite the same explanation as H3 above (immunization is part of planned scheme for this program, as explained in https://wcd.nic.in/sites/default/files/PMMVY%20Scheme%20Implemetation%20Guidelines%20._0.pdf). With push from government programme, parents will aware of this and will follow the recommendation after guidance.

  6. H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.

Hypothesis Testing Method

  1. Narrow down the most affecting variables: Multivariate Regression, see the coefficient along with evaluating the models.
  2. Via correlation, scatter plot, regression: H1, H2, H3, H5.
  3. Via regression (multivariate, since it is categorical variable): H4
  4. Via ESDA + evaluating regression of the Moran Plot: H6.

Summary of Methods

Process in this research are step by step of using statistical tools below:

  1. Person Correlation (check corr and p-values): making sure that the chosen variables are good predictors.
  2. Multivariate regression (check weight of contribution): narrow down, from broad variables to chosen variables (hence, this is the story). Process: OLS via statsmodel, check R^2, check Multicollinearity (using VIF/ Variance Inflation Factors), compare RMSE between models (before and after VIF, per capita vs non per apita variables), then finally check coef (see which has the most influence).
  3. Linear Regression (check how each variables move together vs dependent variables): train-test split using sklearn, plot and visual check, calculate R^2 and RMSE, compare train vs test. Is linear enough?
  4. ESDA: spatial weight based on Queen and Quantiles, Moran Plot + regression evaluation, calculate R^2 and p-value.

To improve inclusivity, all the data accross India (after reduction of not-usable data), which represents >80% of Indian subdistricts.

# Constructs or Variables

Variables that are investigated here and the explanation: | Variables | Represented by | Meaning | Unit | Type | |----------|----------|----------|----------|----------| | The number of graduates and post graduates | total_grd_and_pg_in_village, sums up until subdistrict and district level | Attainment of High Level Education (represents the awareness of parenting and health) | number, int | independent var | | The number of SHG | total_shg | SHG: Self Help Group, bottom up community organization (NGO) with goals to empower woman in financial and leadership (parenting) | number, int | independent var | | The number of ICDS beneficiaries (lactating woman) | total_no_of_lactating_mothers_re | ICDS: Integrated Child Development Services | number, int | independent var | | The number of PMMVY beneficiaries | total_no_of_eligible_beneficiari | Pradhan Mantri Matru Vandana Yojana (PMMVY) is a Maternity Benefit Programme | number, int | independent var | | The availability of Health and Child Health Facility | availability_mother_child_hea | Infrastructure available that may be related to immunization on Children | number, int | independent var | | the number of immunized children | total_childs_aged_0_to_3_years_i | Percentage of children (age 0-3 yrs) with immunization | number, int | dependent var | | the number of immunized children (per capita of number of children 0-3 years) | total_childs_aged_0_to_3_years_i divided by total_childs_aged_0_to_3_years | Percentage of children (age 0-3 yrs) with immunization, use per capita to compare between districts (explanation on Part 04 and 05) | % | dependent var, same as above but only different unit |

In summary, factors that might affecting the Immunization on childs under 3 years age:

  • ECONOMY or GOVT SUBSIDY: Mothers receiving ICDS programme, PMMVY Govt Programme
  • SOCIAL: Self help group, the bottom up community.
  • EDUCATION: 'total_grd_and_pg_in_village' (Hi Education Attainment),
  • INFRASTRUCTURE: 'availability_of_mother_child_hea' (availability of Mother and Child Health Facility),

Reasons of choosing the variables

Most of the selection of the variables in this preliminary research are based on these factors:

  1. Background theory (explained above, which backbone theory that results to this hypothesis)
  2. Availability of the variables that could represents factors on no 1 above.
  3. Story, after doing regression analysis, narrowing them down which are the most appropriate predictors (some hypotheses are in the end statistically insignificant and still stated in this Part 02, although theoretically and philosophucally it should be significance ). Step is simple: choose as many variables that may be affecting the dependent variables, then do analysis and significance test.

Explanation on each of the var (no 1-4 is independent (also no 6), no 5 is dependend)

  1. The education level will represented by number of graduates. People with higher education level are expected to have base of science and rational level thinking. It is not direct to the independent var, but one of the most promising available on the Mission Antyodaya Village Facilities (among others related to education).
  2. The number of SHGs (Self Help Groups) will represents the help of the community to develop mostly women and children. The community is expected to boost the awareness of community-level immunization to children.
  3. Subsidy (push from government) will be represented by the number of beneficiaries (focused on lactating women) of ICDS (Integrated Child Development Services). There is also PMMVY programme to help people's economy during Maternal stage. They are chosen because the goal of the policy and subsidy is to push family to have healthier children in India.
  • ICDS project is targeted to children under 6 years of age and their mothers, so it is related to the targeted unit of analysis in this preliminary research. During the 2018–19 fiscal year, the Indian federal government allocated ₹16,335 crore (US$2.0 billion) to the programme, which is 60% of the funding for the programme while the states allocated the remaining 40% (https://en.wikipedia.org/wiki/Integrated_Child_Development_Services).
  • PMMVY project has important goal: The cash incentive provided would lead to improved health seeking behaviour amongst the Pregnant Women and Lactating Mothers (PW&LM) (https://wcd.nic.in/sites/default/files/PMMVY%20Scheme%20Implemetation%20Guidelines%20._0.pdf). Immunization here is also included in the programme.
  1. Number of health facility is chosen because more facilities will also push the awareness of the people around it, assuming there is activities related to it. The Immunization could not be done by individuals, but should under monitor of government and health related facility.
  2. Extending the explanation on the background, immunization level is chosen because it could not occur by itself inside a community (meaning that some factors will likely to drives it).
  3. Variable added: Global and Local ESDA will be used to investigate correlation between an area with certain immunization level vs its neighbour. It is based on the human behaviour that community will likely to follow some communal behaviour around it ("social conformity" or "herd behavior" https://en.wikipedia.org/wiki/Herd_behavior#:~:text=Herd%20behavior%20is%20the%20behavior,as%20well%20as%20in%20humans.)

Inclusion beneficial: In this research, we will analyze most part of subdistrict all accross India. Price: the memory of computer + slow running time, yet the prediction should be more accurate than taking case study.

Reflection

A. Possible Other Variables that will improve the OUTPUT of RESEARCH.

Factor: Education Level. Alternative variables:

  • Number of people with training about children and parenting: I really want to add this variable because it directly represents the education that will affect on how parent will provide immunization to their children. Children might not come to clinics by theirselves, thus parents hold important role here.

Factor: Economy. Alternative variables:

  • Income level (I dont get it from the dataset): it will give direct impact to the dependent variable instead of subsidy level (because higher income could means that it has more degree of freedom to provide their children with immunization).
  • Poverty level (so I can analyze more inclusively if it is clear that certain area has below the line poverty).

Factor: Health Facilities. Alternative variables:

  • Number of clinics provide immunization. Current assumption is that all health facilities that is analyzed here could provide immunization.

Time dimension:

  • There is no time related dimension, so we can see how immunization level grow over time. If it is available, the regression will be deeper and prediction can be tested properly (not by only simple test-train split).

Proper discussion about per capita variables: some variable should be scaled down to per capita (so it is fair to be compared). However, somehow, from this SHRUG data, it makes the regression and analysis more bias (the correlation is decreasing). If possible: communication with the India Govt, or direct per capita measurement data that I can reach.

B. Possible Rejecting Condition

The hypothesis explained above would be rejected in various condition:

H1 --> Rejected if education level is not align (low correlation) with the immunization level of children age 0-3 years in India.

H2 --> Rejected if the presence of SHGs does not demonstrate a statistically significant correlation with the immunization level of children age 0-3 years in India.

H3 --> Rejected if the communities with a higher number of ICDS beneficiaries do not exhibit a statistically significant increase with the immunization level of children age 0-3 years in India.

H4 --> Rejected if the presence of Health Facilities does not demonstrate a statistically significant correlation with the immunization level of children age 0-3 years in India.

H5 --> Rejected if the communities with a higher number of PMMVY beneficiaries do not exhibit a statistically significant increase with the immunization level of children age 0-3 years in India.

H6 --> Less correlation of area and its neighbouring (more area with high immunization level around low area with immunization level).

Note: other than coefficient in regression, the value will be evaluated. This will not only gained from Pearson Correlation, but also Linear Regression.

C. Possible Bias

Possible biases are explained below:

C.1 Possible cases in which the hypothesis will be falsely rejected

H2 --> Rejected, but turns out it should not be rejected. Because of SHGs is heterogeneous (dynamics) across different communities about immunization level, and the analysis does not account for this variability.

H4 --> Rejected, but it is false because of certain area has lower civilization area (more empty area), thus has very low facilities!

C.2 Possible cases in which the hypothesis will be falsely accepted

H1 --> Accepted, but it is false because we could not make sure that the degree is science and health related or not. It also contains efficiency (meaning that having degree not necessarily means that the person is also have enough cognitive capability).

H3 --> Accepted, but it is false because of other subsidy from government that directly related to the immunization (and they combined each other), for example: only ICDS beneficiaries could accept the subsidy for immunization.

D. Final Reflection and LIMITATION OF THE RESEARCH!!

Important perspective that I am not taking into account:

  • I am not taking into account government policy other than subsidy of ICDS (certain area could really pushed by government by 100% of immunization level on children, for instance because of good policy and enough budget of government).
  • I am not taking into account the time dimension, to see how the variables change overtime.
  • Again, time dimension!! High level of education might not give direct impact (lower than 5 years) to the level of immunization. Observation could also results in bias because the immunized level are because of previous years effort that is not contained in the dataset!
  • After go through to the reseach here, you will find that the variables can be approximated by Linear Regression at the same time the error (RMSE) is not that difference if we change to higher degree of Polynomial Regression. This needs further discussion with experts, such as Mr Trivik or TAs, what could be the possible decision that we should do. Thus, there are room for exploration and evaluation in this research (not final form!).

Some important Critical Data Science framework:

  • Inclusion and Inequality: the research here will focus on using almost all data in various area (around five thousands subsitricts) in India, guaranteeng the representativeness of each of the region. Train Test Split also use Stratified sampling thus guarantee the representativeness in each of the STATEs.
  • Participation: NGOs, Government efforts (subsidy), mothers and children (unit of analysis) will be participated as main source of data.
  • Power: The preliminary research will include choroplets to show the distribution of the immunization level.
  • Positionality: since I am here to do preliminary research, the data will be as objective as possible (only using one dataset from SHRUG is beneficial for this part to prevent bias)

E. What I made differently this time (compared to previous assignment):

  • Deep more into Regression Analysis, even for ESDA (previously only focus on visualization, here we will be critizied with the performance of model).
  • MORE story telling of where the constructs and variables statistically come from.
  • Use more of CDS framework! It is very useful to checklist the possible biases and makes me realize of unperfection of this research (thus I believe that the level here is more like preliminary research than big research itself).
  • Some original data from SHRUG contains modification on categorical variables using 2011 census (which I think will cause further bias). Thus, we will import the original data (with same year as in SHRUG) from OGD (Open Data Government) of India. Further will be explained below on Part 03.
  • Motivate your final choice of variables. Did you change your set of variables compared to assignment 2? If yes, does that choice have an effect on your results and their validity? Answer: the variables are added (the H5 above) because we will found out that it has quite big significance toward predicting the dependent variables. Also, we will focus on the story: choosing many variables, starting in which variables are more likely to affect the immunization level on child, then using regression tools to eliminates one by one!

Part 03: Data Cleaning ¶

In this part, main activities are:

  1. Initial Variables (before elimination throughout entire story)
  2. Data subsetting
  3. Manage Missing Data
  4. Initial Data Visualization
  5. Data Augmentation

0. Initial Variables (before elimination throughout entire story)

As I have explained on Part 01 and 02 above, I will tell more story about broader variables that potentially affect the immunization level on children* which finally will later be narrowed down to variables that statistically have an impact on the dependent variable. This story is based on the knowledge of the author (Fahmi) after taking Regression and ML courses by Mr Trivik.

Basic Variables:

Related to identity: 'pc11_state_id','pc11_district_id','pc11_subdistrict_id',

Primary basic variables:

  • 'total_population','male_population','female_population','total_no_of_registeredchildren',
  • 'total_hhd' (hhd: household),

Potentially Useful Predictors

  • GOVT SUBSIDY: Mothers receiving ICDS programme: 'total_no_of_lactating_mothers' (mother that is lactating phase),'total_no_of_lactating_mothers_re' (which receives ICDS subsidy),
  • GOVT SUBSIDY: PMMVY Govt Programme: 'total_no_of_beneficiaries_receiv','total_no_of_eligible_beneficiari',
  • SOCIAL: Self help group 'total_shg',
  • EDUCATION: 'total_grd_and_pg_in_village' (Hi Education Attainment), 'total_trained_trainee_under_skil' (people under training),
  • INFRASTRUCTURE: 'availability_of_mother_child_hea' (availability of Mother and Child Health Facility), 'availability_of_adult_edu_centre',
  • GOVT REPRESENTATIVES: 'total_no_of_elect_rep_oriented_u'

Dependent Variable itself 'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i',

1. Data Subsetting

In [5]:
selected_columns = ['pc11_state_id','pc11_district_id','pc11_subdistrict_id',
                    'total_population','male_population','female_population','total_no_of_registered_children_','total_hhd',
                    'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i',
                    'total_no_of_lactating_mothers','total_no_of_lactating_mothers_re',
                    'total_no_of_beneficiaries_receiv','total_no_of_eligible_beneficiari',
                    'total_shg',
                    'total_grd_and_pg_in_village',
                    'total_trained_trainee_under_skil',
                    'availability_of_mother_child_hea',
                    'availability_of_adult_edu_centre',
                    'total_no_of_elect_rep_oriented_u','public_info_board'
                    ]

# Select only the columns in the 'selected_columns' array
subset_df = initial_data[selected_columns]
In [6]:
subset_df.describe()
Out[6]:
pc11_state_id pc11_district_id pc11_subdistrict_id total_population male_population female_population total_no_of_registered_children_ total_hhd total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i ... total_no_of_lactating_mothers_re total_no_of_beneficiaries_receiv total_no_of_eligible_beneficiari total_shg total_grd_and_pg_in_village total_trained_trainee_under_skil availability_of_mother_child_hea availability_of_adult_edu_centre total_no_of_elect_rep_oriented_u public_info_board
count 5480.000000 5480.000000 5480.000000 5.444000e+03 5.444000e+03 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 ... 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000 5444.000000
mean 19.396533 367.731204 2948.070803 1.781704e+05 8.930056e+04 82027.793146 8894.053645 36956.871279 6709.910184 4607.438047 ... 1140.140314 499.866559 605.912946 1346.499592 6864.305910 1032.557170 0.466686 0.101931 411.095649 0.525757
std 8.557839 167.451590 1709.091099 1.739284e+05 8.844371e+04 78541.294238 9686.785642 31786.638222 7248.253735 4964.223688 ... 1259.746154 589.028709 697.970053 1116.374731 9194.612951 1278.405733 0.251253 0.147964 613.340890 0.274559
min 1.000000 1.000000 3.000000 3.221229e+01 1.870391e+01 13.508380 0.000000 6.234637 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 10.000000 231.000000 1429.750000 6.000286e+04 2.981878e+04 28914.750000 2867.255248 14795.653926 2037.750000 1538.221895 ... 379.878028 119.000000 154.711955 613.968092 1927.626776 276.832862 0.279317 0.005546 51.000000 0.315324
50% 21.000000 380.000000 2940.500000 1.316506e+05 6.572600e+04 60971.320505 6102.355860 28478.677296 4664.592475 3142.372016 ... 768.572493 300.415780 376.551707 1103.861693 3934.841645 632.705066 0.448054 0.052353 198.109337 0.522764
75% 28.000000 533.000000 4373.250000 2.353213e+05 1.159880e+05 107664.463758 11091.076314 49461.337567 8560.722341 5688.250000 ... 1401.113626 677.745768 809.000000 1806.286511 7907.854403 1345.000000 0.642838 0.135347 542.915193 0.744552
max 35.000000 640.000000 5924.000000 1.927979e+06 1.015549e+06 900343.899912 109745.000000 323759.000000 82890.000000 49923.746582 ... 12921.000000 7104.000000 8364.977439 15806.100780 91699.845468 20077.204134 1.000000 1.000000 12537.922184 1.000000

8 rows × 21 columns

2. Manage Missing Data

First, we need to see which index that are NaN.

In [7]:
subset_df.isna().sum()
Out[7]:
pc11_state_id                        0
pc11_district_id                     0
pc11_subdistrict_id                  0
total_population                    36
male_population                     36
female_population                   36
total_no_of_registered_children_    36
total_hhd                           36
total_childs_aged_0_to_3_years      36
total_childs_aged_0_to_3_years_i    36
total_no_of_lactating_mothers       36
total_no_of_lactating_mothers_re    36
total_no_of_beneficiaries_receiv    36
total_no_of_eligible_beneficiari    36
total_shg                           36
total_grd_and_pg_in_village         36
total_trained_trainee_under_skil    36
availability_of_mother_child_hea    36
availability_of_adult_edu_centre    36
total_no_of_elect_rep_oriented_u    36
public_info_board                   36
dtype: int64

The NAN data counted is 36 (out of 5444 observations), which is about 0,66% of the total observation in each of the columns/ variables. Thus, it is more efficient to delete the NAN data.

Here, I prefer to delete the NA/ NAN values, with some reasons:

  1. We could not assume some parameters (or replace by some numbers, for example with average). For example: the number of SHG (Self Help Groups), the number of people (total population), also the availability of some facilities.
  • If there is NA in SHG of certain area, we should not assume or replace the number of SHGs, because if there is zero number of SHGs, then it is possible that in certain area, there is no single SHG occured. Furthermore, the NA data counts only 0.66%, so it is insignifficant.
  • We should not also assume the number of people in certain subdistrict, because it will need further method to replace the NA value (that is, by calculating the area of the district and multiply it by density) which requires higher effort than delete them (it counts only 0.66% too!)
  • Without real observation, we could not assume whether there is some facilities in certain area or not. It could lead to higher level of bias, instead of just deleting it (price is higher).
In [8]:
# Remove NA value of point 1 and 2 above
columns_to_remove_na = ['total_shg', 'total_population','availability_of_adult_edu_centre','availability_of_mother_child_hea']

# Drop rows with NA values in those columns
subset_df = subset_df.dropna(subset=columns_to_remove_na)

Finally, check if there is more NAN value

In [9]:
# Recheck if there is NA value again
column_where_na = []
row_where_na = []

# Select all columns inside the subset_df DataFrame
all_columns = subset_df.columns

# Line below is used by me to check in each of column, if there is NA. then if there is NA, put it in the new empty array above.
for col in all_columns:
    if subset_df[col].isna().any():
        na_loc = subset_df.index[subset_df[col].isna()].tolist()
        print(f"{col}: NaN at index {na_loc}")
        column_where_na.append(col)
        row_where_na.extend(na_loc)

if not column_where_na:
    print("No NaN values found. Great Job yay!")
else:
    print("There are NaN values. Please check.")
    print("Columns with NaN values:", column_where_na)
    print("Rows with NaN values:", row_where_na)
No NaN values found. Great Job yay!

3. Initial Visualization

AT FIRST we will try to focus on the absolute number (not per capita number) of variables to see the global overview.

Histogram will be used to see the profile of each of the variables. It will be mainly to see how centralized the data will be.

In [10]:
# I defined which column that I will plot in histogram
columns_hist = [
    'total_childs_aged_0_to_3_years_i',
    'total_population',
    'total_no_of_beneficiaries_receiv',
    'total_no_of_lactating_mothers_re',
    'total_shg',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'availability_of_adult_edu_centre',
    'availability_of_mother_child_hea'
]

# public_info_board',

# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 5

# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 16))
axes = axes.flatten()

# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
    axes[i].hist(subset_df[column].dropna(), bins=50, color='blue', alpha=0.7)
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

    if column == 'total_grd_and_pg_in_village':
        axes[i].set_title('Number of Graduates and Postgrads')
        axes[i].set_xlim(0, 50000)  # Set x-limit from 0 to 10000

    if column == 'total_shg':
        axes[i].set_title('Number of SHGs')
        axes[i].set_xlim(0, 8000)  # Set x-limit from 0 to 10000

    if column == 'total_no_of_lactating_mothers_re':
        axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
        axes[i].set_xlim(0, 10000)  # Set x-limit from 0 to 10000

    if column == 'total_childs_aged_0_to_3_years_i':
        axes[i].set_title('Number of Children Age 0-3 Years Immunized')
        axes[i].set_xlim(0, 30000)  # Set x-limit from 0 to 30000   

    if column == 'total_no_of_beneficiaries_receiv':
        axes[i].set_title('Number of PMMVY Beneficiaries')
        axes[i].set_xlim(0, 3000)  # Set x-limit from 0 to 3000

    if column == 'total_trained_trainee_under_skil':
        axes[i].set_title('Number of Trainees')
        axes[i].set_xlim(0, 10000)  # Set x-limit from 0 to 10000 
    
    if column == 'availability_of_adult_edu_centre':
        axes[i].set_title('Availability of Adult Edu Center')
    
    if column == 'total_no_of_elect_rep_oriented_u':
        axes[i].set_title('Number of Elected Representative for Govt Program')
        axes[i].set_xlim(0, 4000)  # Set x-limit from 0 to 4000 

    if column == 'public_info_board':
        axes[i].set_title('Availability of Public Information Board')

    if column == 'total_population':
        axes[i].set_title('Total Population in each of Subdistricts')  

    if column == 'availability_of_mother_child_hea':
        axes[i].set_title('Availability of Mother-Child Heath Facil.')  


# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()

Important: Graphical Excellence will be put in the end of each Part (not each of the graph), so the reader is more convenience in following the story that we are making here.

As we can see above, especially the last two histograms, there are some points of attention:

The 'availability' variables, such as availability of health facilities and education center, they are all not categorical while it should be. I have checked the raw files from the SHRUG, for both district and subdistrict, they both have the same characteristics. This would lead to problem if we do not understand the idea behind each of the variables.

Luckily, it is stated on the SHRUG documentation (https://docs.devdatalab.org/SHRUG-Metadata/Mission%20Antyodaya%20Village%20Facilities%20%282020%29/antyodaya-metadata/#:~:text=refer%20to%20this-,note,-for%20more%20details):

"(they) aggregated binary data on availability of facilities (yes/no) using 2011 Census population weights. This gives the following interpretation for any variable: share of population within a location that has access to a facility. Example: Consider public_transport - whether a village has access to public transport. In the subdistrict-level data, the variable will take values between 0 and 1. A value of 0 for a subdistrict would mean that 0% of population within the subdistrict has access to public transport."

The problem is then: the SHRUG data is based on 2020 record, while thus aggregation on binary data is using 2011 cencus weight. To make the data more accurate, luckily (again), we can augment the data from external or other sources, minding the same context. There is available data from Indian Government, under the same programme with our focus here: Mission Antyodaya Survey Data which you can access from Open Government Data (OGD) of India (https://data.gov.in/catalog/mission-antyodaya-survey-data?page=1). The data year is 2020, the same year with the SHRUG's Mission Antyodaya Survey Data (that is the only available data there).

We can also get the data from website of the program: https://missionantyodaya.nic.in/ma2020/rawData2020.html. On the website, there is also the documentation of each of the variables.

We can examine our focus on the 'strange' data, to recheck whether our judgement is true or not. 'Strange' variables:
'availability_of_adult_edu_centre', 'availability_of_mother_child_hea',

Then, looking at the documentation (download by clicking "Download Supportive Documents" from https://missionantyodaya.nic.in/ma2020/rawData2020.html):

  1. Variable 'availability_of_adult_edu_centre': contains 1 as Yes, 2 as No.
  2. Variable 'availability_of_mother_child_hea': contains 1 as Yes, 2 as No.

Hence, our judgement that the data is not categorical by looking at the plot of histogram above (last two histograms), is quite right.

Then, what next?

The external data will be downloaded and be augmented to our main dataset based on the subdistrict_id. In this case, next, what important is to augment the data. Let's go!

4. Data Augmentation

There will be two data augmentations:

  1. Step 01: Combining all the downloaded data from OGD into single dataframe, because it is separated into different files based on the state (there are around 38 listed states).
  2. Step 02: Combining subset dataframe (from previous steps) with SHP file (geopkg). It is needed because as you can see from previous head function on Part 01, the district and subdistricts contains only the code (1 until some numbers), not real name. It will be intiuitive for us if there is the column that containes the common on real name of the district and subdistricts.
  3. Step 03: Combine, pivot, and vlookup dataframe from Step 01 and 02 above to get the final data.

Step 01: Preparing the Raw Data from Open Government Data (OGD) India

In [11]:
# Location of where I put the data
folder_path = './data/ogd_downloadeddata_allindia'

# Because I found that not all states are available, my plan to use for loop containing 1.xlsx until 38.xlsx was cancelled
# Instead, to make it more flexible, we will scan all available file from the folder_path and creat a dataframe containing the name of the file that has been scanned
excel_files = [file for file in os.listdir(folder_path) if file.endswith('.xlsx')]

# Initialize an empty DataFrame to store the combined data
combined_df = pd.DataFrame()
In [12]:
# Read all the excel files and combine it inside the combined_df
# Yes, it take 10 mins for my computer to run this... Good Luck to us.
for file in excel_files:
    file_path = os.path.join(folder_path, file)
    df = pd.read_excel(file_path)
    combined_df = pd.concat([combined_df, df], ignore_index=True)

Since we will need the OGD data above only on column 'availability_of_mother_child_hea' and 'availability_of_public_board', we will slice the combined_df.

In [13]:
selected_columns = [
    'state_code',
    'state_name',
    'district_code',
    'district_name',
    'sub_district_code',
    'sub_district_name',
    'total_childs_aged_0_to_3_years_immunized',
    'total_population',
    'availability_of_adult_edu_centre',
    'availability_of_public_information_board',
    'availability_of_mother_child_health_facilities'
]

ogd_data = combined_df[selected_columns].copy()

Then, let us check the unique value on those two important variables:

In [14]:
selected_columns = [
    'availability_of_adult_edu_centre',
    'availability_of_mother_child_health_facilities'
]
# Please note that some columns from ODP data have different naming compared to SHRUG

for column in selected_columns:
    value_counts = ogd_data[column].value_counts()
    print(f"Value counts for '{column}':\n{value_counts}\n")
Value counts for 'availability_of_adult_edu_centre':
availability_of_adult_edu_centre
2.0      605522
1.0       42806
0.0           7
135.0         1
Name: count, dtype: int64

Value counts for 'availability_of_mother_child_health_facilities':
availability_of_mother_child_health_facilities
2      444216
1      204112
0           5
120         2
20          2
35          2
26          1
350         1
140         1
41          1
56          1
24          1
22          1
10          1
52          1
27          1
15          1
302         1
182         1
18          1
30          1
320         1
110         1
352         1
108         1
Name: count, dtype: int64

We can see that there is still some outlier on 'availability_of_adult_edu_centre' column. Based on the documentation that I have explained above, the value should only be 1 or 2. But, there is value 0 and 135 on variable availability_of_adult_edu_centre, and even 0, 120, 20, etc on variable availability_of_mother_child_health_facilities.

Since the count is very small (compared to normal 1 and 2 data), we will delete those outliers or error data.

In [15]:
ogd_data = ogd_data[(ogd_data['availability_of_adult_edu_centre'] != 0) & (ogd_data['availability_of_adult_edu_centre'] != 135)]
ogd_data = ogd_data[(ogd_data['availability_of_mother_child_health_facilities'] >= 1) & (ogd_data['availability_of_mother_child_health_facilities'] <= 2)]
ogd_data.head()
Out[15]:
state_code state_name district_code district_name sub_district_code sub_district_name total_childs_aged_0_to_3_years_immunized total_population availability_of_adult_edu_centre availability_of_public_information_board availability_of_mother_child_health_facilities
0 19 WEST BENGAL 316 MALDAH 2208 Harischandrapur - I 10 261.0 2.0 1 2
1 19 WEST BENGAL 316 MALDAH 2213 Ratua - II 0 1050.0 2.0 1 2
2 19 WEST BENGAL 312 HOOGHLY 2348 Haripal 51 4176.0 2.0 2 2
3 19 WEST BENGAL 318 MEDINIPUR WEST 2468 Dantan - II 16 269.0 2.0 1 1
4 19 WEST BENGAL 304 24 PARAGANAS SOUTH 2414 Budge Budge - II 268 7800.0 2.0 2 1

NEXT:

As we can see above, each row equal to number of observations with village basis. Thus, there are two rows with same subdistrict: 'Maldah' on index 0 and 1. In fact, that is too granular for our research here if we use village basis. We will only need to have district and subdistrict data (remember our decision on Part 00 and 01), not village.

Thus, we will need to make a pivot, which is basically choosing a representative data so that each of subdistrict has one value of 'availability_of_adult_edu_centre' and 'availability_of_mother_child_health_facilities'.

Super important notes on logic behind choosing the representative data:

  1. 'availability_of_adult_edu_centre' is YES/NO categorical. Thus, to pivot each subdistrict, we will use logic "modus" or "Most Frequent Value". For example: subdistrict A with availability of adult education center in its village A1, A2, A3 is respectively 1 (yes), 2 (no), 2 (no). The mode/ modus of this availability is then 2 (no). Hence, as conclusion, subdistrict A's availability of adult edu center is 2 (no).
  2. 'availability_of_mother_child_health_facilities' is 1/2 categorical. We will use the same logic as No 1 above.
  3. We ensure that the selected representative value is the one that occurs most frequently within each subdistrict, thus we provide consolidated of the categorical variable for that specific area.
  4. Mode is a simple and intuitive measure of central tendency for categorical data. It represents the most frequently occurring value and is easy to understand. Other methods, like taking the mean, may not make sense for categorical variables.
In [16]:
# Group by the relevant columns
grouped_data = ogd_data.groupby(['state_code', 'state_name', 'district_code', 'district_name', 'sub_district_code', 'sub_district_name'])

# Define aggregation functions
agg_functions = {
    'total_childs_aged_0_to_3_years_immunized': 'sum',
    'total_population': 'sum',
    'availability_of_adult_edu_centre': lambda x: x.mode().iloc[0] if not x.mode().empty else None,
    'availability_of_mother_child_health_facilities': lambda x: x.mode().iloc[0] if not x.mode().empty else None
}

# Apply aggregation functions to create the pivoted DataFrame
ogd_data_pivot = grouped_data.agg(agg_functions).reset_index()
In [17]:
ogd_data_pivot.head()
Out[17]:
state_code state_name district_code district_name sub_district_code sub_district_name total_childs_aged_0_to_3_years_immunized total_population availability_of_adult_edu_centre availability_of_mother_child_health_facilities
0 1 JAMMU AND KASHMIR 1 ANANTNAG 53 Pahalgam 2117 61296 2.0 2
1 1 JAMMU AND KASHMIR 1 ANANTNAG 54 Bijbehara 2146 112828 2.0 2
2 1 JAMMU AND KASHMIR 1 ANANTNAG 55 Anantnag 2054 96869 2.0 1
3 1 JAMMU AND KASHMIR 1 ANANTNAG 56 Shangus 2608 96970 2.0 2
4 1 JAMMU AND KASHMIR 1 ANANTNAG 57 Kokernag 3372 142602 2.0 2

Step 02: Combine SHP (GEOPACKAGE) and SHRUG

Why? because the SHRUG files only contain code name for state, district, and subdistricts. We need "real name" so it is easier to validate the data in each of process with reality (or example, by googling).

First, SUBDISTRICT SHRUG + GPKG

In [18]:
map_path = './data/shrug-pc11subdist-poly-gpkg/subdistrict.gpkg'
map = gpd.read_file(map_path)
gdf = map

The column 'pc11_subdistrict_id' of GDF is stored as object and the other one (the subset_df, from Part 01 and 02) is stored as int. We will change both the object and int datatype to float, to prevent if in the future we will need to merge some other data. Float has more 'flexible' value than 'integer' (for example, float can have decimal).

In [19]:
gdf.pc11_subdistrict_id = gdf.pc11_subdistrict_id.astype(float)
subset_df.pc11_subdistrict_id = subset_df.pc11_subdistrict_id.astype(float)
In [20]:
# Name the merged variable as geo_pop
# geo_pop means data that contains population and its socioeconomic + geometry!!
geo_pop = gdf.merge(subset_df, on='pc11_subdistrict_id')
geo_pop.head(3)
Out[20]:
pc11_state_id_x pc11_district_id_x pc11_subdistrict_id subdistrict_name geometry pc11_state_id_y pc11_district_id_y total_population male_population female_population ... total_no_of_lactating_mothers_re total_no_of_beneficiaries_receiv total_no_of_eligible_beneficiari total_shg total_grd_and_pg_in_village total_trained_trainee_under_skil availability_of_mother_child_hea availability_of_adult_edu_centre total_no_of_elect_rep_oriented_u public_info_board
0 24 468 3722.0 Lakhpat MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... 24 468 62117.708950 32088.750366 30025.930963 ... 204.869050 350.194879 387.535543 566.165208 394.599993 239.182093 0.661402 0.461577 51.469564 0.926652
1 24 468 3723.0 Rapar MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... 24 468 201902.906291 105105.320376 96796.567185 ... 861.844679 471.671497 645.874145 339.236735 2531.541403 1269.336253 0.873535 0.382873 397.304285 0.897813
2 24 468 3724.0 Bhachau MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 24 468 176504.000000 92678.000000 83460.000000 ... 790.000000 450.000000 498.000000 234.000000 2577.000000 468.000000 0.623509 0.135301 274.000000 0.933517

3 rows × 25 columns

In [21]:
# Because pc11_state_id_x = pc11_state_id_y and pc11_state_id_x = pc11_state_id_y
# We will delete column of y
# and rename the column pc11_state_id_x = pc11_state_id and pc11_state_id_x = pc11_state_id
geo_pop = geo_pop.drop(['pc11_state_id_y', 'pc11_district_id_y'], axis=1)

# Rename columns
geo_pop = geo_pop.rename(columns={'pc11_state_id_x': 'pc11_state_id', 'pc11_district_id_x': 'pc11_district_id'})

Second, DISTRICT SHRUG + GPKG

Why doing this:

  1. since the geo_pop above only contains the real name of subdistrict, I think this is also good idea to put "district name" into the "geo_pop" data. To get the name of the district, we need the GPKG files from district level. Then, we can use vlookup from the "geo_pop_district".
  2. There is error when using subdistrict GPKG when creating Spatial Weights (too many islands found). district GPKG creates better result --> part of Non Linear Process.
In [22]:
mapdist_path = './data/shrug-pc11dist-poly-gpkg/district.gpkg'
map_dist = gpd.read_file(mapdist_path)

Since the GPKG data above is in district, while "subset_df" is in subdistrict, we need to create pivot containing sum of all the columns (total population, total childs, etc)

  1. Non categorical variables will be sumed up (SUM). For example: District A contains subdistrict A1, A2, A3. Thus, total population of district A = tot population of A1 + A2 + A3.
In [23]:
columns_to_sum = ['total_population', 'male_population', 'female_population', 'total_hhd',
                  'total_grd_and_pg_in_village', 'total_shg',
                  'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
                  'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
                  'total_no_of_beneficiaries_receiv','total_no_of_eligible_beneficiari',
                  'total_trained_trainee_under_skil','total_no_of_elect_rep_oriented_u']

# Create a pivot table
pivot_table_df = subset_df.pivot_table(values=columns_to_sum, index='pc11_district_id', aggfunc='sum')

pivot_table_df.head(3)
Out[23]:
female_population male_population total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i total_grd_and_pg_in_village total_hhd total_no_of_beneficiaries_receiv total_no_of_elect_rep_oriented_u total_no_of_eligible_beneficiari total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_population total_shg total_trained_trainee_under_skil
pc11_district_id
1 23523.138329 26510.879626 2761.074569 2590.694210 704.340235 8981.478930 168.859106 115.615244 206.890436 505.056064 479.194760 50041.624221 1.521253 109.530231
2 360076.838747 392652.138478 29465.549515 19313.974815 33116.408459 124419.391360 1013.957246 227.937998 1571.817218 4322.707964 3710.070887 755114.634253 1291.782192 1633.911806
3 45598.474752 47615.448440 2837.299473 2422.499434 3636.436542 16684.780804 242.444825 194.343043 314.018065 791.205308 756.308394 93219.355902 350.046346 191.291275

Do the merge of the two dataset (GPKG and pivot_table) by the same step as subdistrict above.

In [24]:
# Convert to astype
map_dist.pc11_district_id = map_dist.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)

# Next, merge using the indices
geo_pop_district = map_dist.merge(pivot_table_df, left_on='pc11_district_id', right_index=True)
geo_pop_district.head(3)
Out[24]:
pc11_state_id pc11_district_id district_name geometry female_population male_population total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i total_grd_and_pg_in_village total_hhd total_no_of_beneficiaries_receiv total_no_of_elect_rep_oriented_u total_no_of_eligible_beneficiari total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_population total_shg total_trained_trainee_under_skil
0 24 468.0 Kachchh MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 6.918732e+05 7.659721e+05 52789.062307 36124.140136 29509.874064 332446.747616 3933.575088 2623.179898 4593.834795 7194.496554 6452.874545 1.465846e+06 5262.200653 9285.792694
1 24 469.0 Banas Kantha MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... 1.490841e+06 1.603543e+06 137018.185689 93865.516320 103408.588403 607873.050328 9331.840111 8527.665757 11595.960114 19340.773617 16878.197941 3.273774e+06 13546.601715 24814.951196
2 24 470.0 Patan MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... 5.204153e+05 5.651527e+05 49457.172519 30594.736482 64222.702638 258319.762048 1643.088208 3397.482485 2437.191204 9340.013570 7797.987739 1.089937e+06 14351.448633 30910.742259

Third, DISTRICT + SUBDISTRICT

As I have explained before, we will complete the subset_df in district level with "real district" name so it is more intuitive.

In [25]:
# Convert the object to astype
geo_pop_district.pc11_district_id = geo_pop_district.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)

# Next, merge using the indices
geo_pop = pd.merge(geo_pop, geo_pop_district[['pc11_district_id', 'district_name']], on='pc11_district_id', how='left')
In [26]:
# Also, do not forget to lowercase the strings
geo_pop['district_name'] = geo_pop['district_name'].str.lower()
geo_pop['subdistrict_name'] = geo_pop['subdistrict_name'].str.lower()

geo_pop.head(3)
# District name is on the very right on the review below.
Out[26]:
pc11_state_id pc11_district_id pc11_subdistrict_id subdistrict_name geometry total_population male_population female_population total_no_of_registered_children_ total_hhd ... total_no_of_beneficiaries_receiv total_no_of_eligible_beneficiari total_shg total_grd_and_pg_in_village total_trained_trainee_under_skil availability_of_mother_child_hea availability_of_adult_edu_centre total_no_of_elect_rep_oriented_u public_info_board district_name
0 24 468.0 3722.0 lakhpat MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... 62117.708950 32088.750366 30025.930963 4008.570778 12319.391613 ... 350.194879 387.535543 566.165208 394.599993 239.182093 0.661402 0.461577 51.469564 0.926652 kachchh
1 24 468.0 3723.0 rapar MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... 201902.906291 105105.320376 96796.567185 9754.329549 42524.801920 ... 471.671497 645.874145 339.236735 2531.541403 1269.336253 0.873535 0.382873 397.304285 0.897813 kachchh
2 24 468.0 3724.0 bhachau MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 176504.000000 92678.000000 83460.000000 9545.000000 41865.000000 ... 450.000000 498.000000 234.000000 2577.000000 468.000000 0.623509 0.135301 274.000000 0.933517 kachchh

3 rows × 24 columns

Step 03: Joining the Two Dataframe

Next step is to join data from step 01 (ogd_data_pivot, containing availability data of Adult Education Center and Health Facilities) and step 02 (geo_pop, containg all socio-economic data). Both will be combined with respect to "subdistrict_id".

Why do we use "subdistrict_id"?

Well, it is basically a long process and non-linear.

  1. At first, I want to merge the two df based on the real name of "district" + "subdistrict". Turns out that some subdistricts have literally the same name but different districts. It results in more bias during processing of the dataframe.
  2. Not all the SHRUG GPKG is correct (or it might be me that just not observant enough during data processing). For example, from GPKG: Bhiloda subdistrict is at Sabar Kantha. While from OGD file (confirmed via wikipedia also), Bhiloda is actually at Arvalli District. Initially, these two steps result in 20% of missing data (WOW that is much!).
  3. Fortunately, I found out that subdistrict_id for both the dataframe corresponds to each other. Voila, we can use this as the basis of the pivoting or vlookup!
In [27]:
# First, we need to rename the old column of 'availability_of_mother_child_hea', 'availability_of_adult_edu_centre' in geo_pop
geo_pop.rename(columns={'availability_of_mother_child_hea' : 'availability_of_mother_child_hea_OLD'}, inplace=True)
geo_pop.rename(columns={'availability_of_adult_edu_centre' : 'availability_of_adult_edu_centre_OLD'}, inplace=True)

# or we can drop instead
#geo_pop.drop(['availability_of_mother_child_hea'], axis=1, inplace=True)
#geo_pop.drop(['availability_of_adult_edu_centre'], axis=1, inplace=True)
In [28]:
# Second, because subdistrict id column of ogd data is named 'sub_district_code' while it is named 'pc11_subdistrict_id' in geopop (notice the use of "_"), 
# we will change the name of the ogd column
ogd_data_pivot.rename(columns={'sub_district_code':'pc11_subdistrict_id'},inplace=True)
In [29]:
# Third, merge dataframes based on 'subdistrict_name' and 'district_name'
dataframe_merged = pd.merge(geo_pop, ogd_data_pivot[['pc11_subdistrict_id','availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre']],
                    on=['pc11_subdistrict_id'], how='left')
In [30]:
dataframe_merged.head()
Out[30]:
pc11_state_id pc11_district_id pc11_subdistrict_id subdistrict_name geometry total_population male_population female_population total_no_of_registered_children_ total_hhd ... total_shg total_grd_and_pg_in_village total_trained_trainee_under_skil availability_of_mother_child_hea_OLD availability_of_adult_edu_centre_OLD total_no_of_elect_rep_oriented_u public_info_board district_name availability_of_mother_child_health_facilities availability_of_adult_edu_centre
0 24 468.0 3722.0 lakhpat MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... 62117.708950 32088.750366 30025.930963 4008.570778 12319.391613 ... 566.165208 394.599993 239.182093 0.661402 0.461577 51.469564 0.926652 kachchh 1.0 2.0
1 24 468.0 3723.0 rapar MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... 201902.906291 105105.320376 96796.567185 9754.329549 42524.801920 ... 339.236735 2531.541403 1269.336253 0.873535 0.382873 397.304285 0.897813 kachchh 1.0 2.0
2 24 468.0 3724.0 bhachau MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 176504.000000 92678.000000 83460.000000 9545.000000 41865.000000 ... 234.000000 2577.000000 468.000000 0.623509 0.135301 274.000000 0.933517 kachchh 2.0 2.0
3 24 468.0 3725.0 anjar POLYGON ((70.23631 23.40849, 70.23631 23.40545... 150105.112093 79896.579777 70092.396031 7687.807328 34191.766772 ... 491.505351 3643.775958 1258.834380 0.830914 0.059282 239.531089 1.000000 kachchh 1.0 2.0
4 24 468.0 3726.0 bhuj POLYGON ((69.78433 23.99110, 69.79143 23.98750... 238834.277866 119211.778880 113669.738389 9341.469566 54106.743999 ... 230.560895 2819.277177 1145.297841 0.486578 0.140952 233.778024 0.638809 kachchh 2.0 2.0

5 rows × 26 columns

In [31]:
# Check the result by counting NaN values in 'availability_of_mother_child_health_facilities' and 'availability_of_adult_edu_centre' columns
dataframe_merged[['availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre']].isna().sum()
Out[31]:
availability_of_mother_child_health_facilities    33
availability_of_adult_edu_centre                  33
dtype: int64

We can delete the NAN rows above, because 33 is quite small (33/5454 = 0,6%) which is negligible.

In [32]:
dataframe_merged.dropna(subset=['availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre'], inplace=True)

The dataframe is better ready to be plotted, let us see on the next steps!!

Let us check again how each of the variable goes.

In [33]:
# I defined which column that I will plot in histogram
columns_hist = [
    'total_childs_aged_0_to_3_years_i',
    'total_population',
    'total_no_of_beneficiaries_receiv',
    'total_no_of_lactating_mothers_re',
    'total_shg',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'availability_of_mother_child_health_facilities',
    'availability_of_adult_edu_centre'
]

# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 5

# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 16))
axes = axes.flatten()

# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
    axes[i].hist(dataframe_merged[column].dropna(), bins=75, color='blue', alpha=0.7)
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

    if column == 'total_grd_and_pg_in_village':
        axes[i].set_title('Number of Graduates and Postgrads')
        axes[i].set_xlim(0, 50000)  # Set x-limit from 0 to 10000

    if column == 'total_shg':
        axes[i].set_title('Number of SHGs')
        axes[i].set_xlim(0, 8000)  # Set x-limit from 0 to 10000

    if column == 'total_no_of_lactating_mothers_re':
        axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
        axes[i].set_xlim(0, 10000)  # Set x-limit from 0 to 10000

    if column == 'total_childs_aged_0_to_3_years_i':
        axes[i].set_title('Number of Children Age 0-3 Years Immunized')
        axes[i].set_xlim(0, 30000)  # Set x-limit from 0 to 30000   

    if column == 'total_no_of_beneficiaries_receiv':
        axes[i].set_title('Number of PMMVY Beneficiaries')
        axes[i].set_xlim(0, 3000)  # Set x-limit from 0 to 3000

    if column == 'total_trained_trainee_under_skil':
        axes[i].set_title('Number of Trainees')
        axes[i].set_xlim(0, 10000)  # Set x-limit from 0 to 10000 
    
    if column == 'availability_of_adult_edu_centre':
        axes[i].set_title('Availability of Adult Edu Center')
    
    if column == 'total_no_of_elect_rep_oriented_u':
        axes[i].set_title('Number of Elected Representative for Govt Program')
        axes[i].set_xlim(0, 3000)  # Set x-limit from 0 to 3000 

    if column == 'availability_of_mother_child_health_facilities':
        axes[i].set_title('Availability of Mother & Child Health Facil.')

    if column == 'total_population':
        axes[i].set_title('Total Population in each of Sub District')    
        axes[i].set_xlim(0, 1000000)  # Set x-limit from 0 to 10000


# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()

We can see that the data of Availability is fixed, as it should be defined on the documentation, without modification of old data (for example, 2011 cencus that might be out dated) (same context with what the documentation of SHRUG describes!). We can move then to next steps!

Next, do some calculation (adding new important variables). We will need to make a new variables, so that it is more inclusive and fair (it is possible to have a case like the number of immunized children is higher in certain area because it has more children!).

Part of Non Linear Process:

There are some interesting findings regarding treating per capita or rate variables:

  1. Based on Reference [1], per capita can lead to distort result (something that we might find here). Per capita is recommended to be used in following conditions:
  • Involving comparison among cases, involving such measures as "relative deprivation" of one group of people contrasted to that of other populations.
  • The relevant independent variables are actually employed by decision makers.
  1. Based on Lecture 13 of Spatial Data Science class by Mr Trivik, on slide No 27, that "Points usually represent events that affect only part of the population and hence are best considered as rates (see Lecture 6)".

Thus, depends on the purpose, "rate of immunization level" vs "absolute number of children with immunization" can be used both or only one of them. Here, we will see how this affects the regression.

Let us continue first to make the "rate" variables.

List of new variables:

  1. Percentage of immunized children = total number of children with immunization (0-3 years) / total number of children age 0-3 yrs.
  2. Ratio of people with higher education = total number of graduates(or postgrads) / total population
  3. Ratio of SHGs (per total population) = total number of SHGs / total population
  4. Percentage of lactating mother that receive ICDS = Number of lactating mothers receiving services under ICDS / Number of lactating mothers
  5. Percentage of beneficiaries receiving Pradhan Mantri Matru Vandana Yojana (PMMVY) = Number of beneficiaries receiving benefits under Pradhan Mantri Matru Vandana Yojana / Number of eligible beneficiaries under Pradhan Mantri Matru Vandana Yojana
  6. Ratio of trained people under skill development = Total trained trainee / total population
  7. Ratio of trained representative = Number of elected representatives oriented under Rashtriya Gram Swaraj Abhiyan / total population.

Note:

  1. Ratio = comparing between a variable with total population (per capita). Some variables will also have the same unit: %.
  2. Percentage = comparing a variable under some circumstances with total value of those variable (for example: immunized children divided by total number of children).

Reference: [1] Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513

The same reason also applied for these variables below, because in the future, it might be needed for per capita calculation (thus, if the value is NaN or zero, it will just result in NA or 0/0). The analysis will not be applicable of there is no children in an area or even no mother there.

  • Total population
  • Total number of children age 0-3 years
  • Total number of lactating mothers.
  • Total number of eligible beneficiaries.
In [34]:
# Drop rows where specified columns contain only 0 for lactating mothers
columns_to_remove_zero = ['total_no_of_lactating_mothers']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]

# Do the same for total child ages 0-3 yrs
columns_to_remove_zero = ['total_childs_aged_0_to_3_years']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]

# And others:
columns_to_remove_zero = ['total_population']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
columns_to_remove_zero = ['total_no_of_eligible_beneficiari']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
In [35]:
# No 01
dataframe_merged['immunized_child_pct'] = (dataframe_merged['total_childs_aged_0_to_3_years_i'] / dataframe_merged['total_childs_aged_0_to_3_years']).round(3) * 100

# No 02
dataframe_merged['grd_rat'] = (dataframe_merged['total_grd_and_pg_in_village'] / dataframe_merged['total_population']).round(3) * 100

# No 03
dataframe_merged['shg_rat'] = (dataframe_merged['total_shg'] / dataframe_merged['total_population']).round(5) * 100

# No 04
dataframe_merged['icds_women_pct'] = (dataframe_merged['total_no_of_lactating_mothers_re'] / dataframe_merged['total_no_of_lactating_mothers']).round(3) * 100

# No 05
dataframe_merged['pmmvy_pct'] = (dataframe_merged['total_no_of_beneficiaries_receiv'] / dataframe_merged['total_no_of_eligible_beneficiari']).round(3) * 100

# No 06
dataframe_merged['trained_ratio'] = (dataframe_merged['total_trained_trainee_under_skil'] / dataframe_merged['total_population']).round(3) * 100

# No 07
dataframe_merged['trained_repr_rat'] = (dataframe_merged['total_no_of_elect_rep_oriented_u'] / dataframe_merged['total_population']).round(3) * 100

Reflection

CDS Framework in Data Transforming

  1. Inclusion: here, instead of just accepting the variable because it combines multiple data from different census (our data is 2020, but the data provider from SHRUG said that for categorical variables, they modify using 2011 census data). Luckily, there is available original data from Open Government Data India that we can take and combine with the SHRUG files. Making the source more reliable because it comes from many externals, but of course still on the same timeline (2020).
  2. Inequality: The distribution of the data makes more sense, which area is now has health facilities and not, making it more readable, understandable, and can be analyzed further.
  3. Participation: each step of modification is clearly stated here, and you can also download the data by yourself from the link that I have provided above. I believe you can step by step follow the stories here.

Technicality concerns:

  1. Please note here that we combine two external sources (OGD + SHRUG), although they have the same time of record and context is the same.

Graphical excellence in this Part 03:

  1. Clear and make good use of space
  2. Clear title and legend
  3. Colour is friendly to color blind people, not using polar colour (hurt the eyes).
  4. Clear purpose and easy to be understood.

Part 04: Exploratory Data Analysis: Regression ¶

In this part, there will be two main activities are: Part A: Regression Preparation Activities:

  1. Outliers check and elimination
  2. Mapping the Correlation between variables
  3. One Hot Encoding or Dummy Variables making on categorical data
  4. Train - Test Split
  5. VIF feature creation

Part B: Regression Process and Evaluation Activities:

  1. Linear Regression
  2. Linear Regression with dropping features using VIF
  3. Ridge Regression

Need to be evaluated in each of Regression:

  • Fit-ness of the regression
  • Multicollinearity
  • R^2 and adjusted R^2
  • Coefficients and p-values
  • RMSE

Before going further, why do we need to do Regression (using OLS)?

  1. The main purpose of the research: explore the variables affecting a dependent variables. See which independent variables affect the most and how the models are performing.
  2. Simple and robust initial analysis tool, before going further to a more advanced one such as clustering.
  3. Based on the initial graph (in this Part 04), most of the features are in linear trend. OLS is particularly well-suited for linear relationships between the dependent and independent variables. If your data exhibits a linear trend, OLS provides reliable and unbiased estimates of the coefficients.
  4. OLS provides a solid foundation for statistical inference (that is why we mostly use STATSMODEL), allowing us to test hypotheses about the coefficients, assess the overall fit of the model, and make predictions with confidence intervals.

Part A: Regression Preparation

01 - Outlier Check and Elimination

Let us check first the type of data, to make sure that the desired variables can be processed.

In [36]:
dataframe_merged.dtypes
Out[36]:
pc11_state_id                                       object
pc11_district_id                                   float64
pc11_subdistrict_id                                float64
subdistrict_name                                    object
geometry                                          geometry
total_population                                   float64
male_population                                    float64
female_population                                  float64
total_no_of_registered_children_                   float64
total_hhd                                          float64
total_childs_aged_0_to_3_years                     float64
total_childs_aged_0_to_3_years_i                   float64
total_no_of_lactating_mothers                      float64
total_no_of_lactating_mothers_re                   float64
total_no_of_beneficiaries_receiv                   float64
total_no_of_eligible_beneficiari                   float64
total_shg                                          float64
total_grd_and_pg_in_village                        float64
total_trained_trainee_under_skil                   float64
availability_of_mother_child_hea_OLD               float64
availability_of_adult_edu_centre_OLD               float64
total_no_of_elect_rep_oriented_u                   float64
public_info_board                                  float64
district_name                                       object
availability_of_mother_child_health_facilities     float64
availability_of_adult_edu_centre                   float64
immunized_child_pct                                float64
grd_rat                                            float64
shg_rat                                            float64
icds_women_pct                                     float64
pmmvy_pct                                          float64
trained_ratio                                      float64
trained_repr_rat                                   float64
dtype: object

Next, let us see the scale in each of the variables.

In [127]:
# Specify the columns you want to create boxplots for
columns_name = [
    'total_population',
    'total_childs_aged_0_to_3_years_i',
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_no_of_elect_rep_oriented_u',
    'total_trained_trainee_under_skil',
    'total_shg',
]

# Specify the names for y-axis labels, so it can easily be read by you 
columns_title = ['Total Population in Subdistricts','Childs (0-3) Immunized', 'Lact. Mothers with ICDS', 'PMMVY Benefic.', 'Hi Edu Level', 'Representatives', 'Under Training', 'SHGs']

# Create subplots
fig, axs = plt.subplots(ncols=len(columns_name), figsize=(20, 5))

# Plot boxplots for each selected column
for i, column in enumerate(columns_name):
    sns.boxplot(y=dataframe_merged[column], ax=axs[i])
    
    # Add line for 3 times the standard deviation
    mean_value = dataframe_merged[column].mean()
    std_dev = dataframe_merged[column].std()
    threshold = mean_value + 3 * std_dev
    axs[i].axhline(y=threshold, color='r', linestyle='--', label='3x SD')
    
    # Set y-axis label
    axs[i].set_ylabel(columns_title[i])

# Add a title to the entire set of subplots
plt.suptitle("Box Plot of Socioeconomics data related to Child Immunization in India", y=1.02, fontsize=16)

# Adjust layout
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
plt.show()

We can see that other than Total Population, the independent variables are around the same scale (scale of thousands).

The Red Line above shows 3 x standard deviation. Some practice to eliminate the outliers after making the boxplot such above, is to delete the value that is outside of 3 x standard deviation. This 3x stdev corresponds to 99.7% of the whole data assuming gaussian profiles.

Some sources related to this:

  1. https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
  2. https://towardsdatascience.com/68-95-99-7-the-three-sigma-rule-of-thumb-used-in-power-bi-59cd50b242e2
  3. '68–95–99.7 rule': https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

Let us see how much this outliers in each of the columns/ variables:

In [128]:
for column in columns_name:
    mean_value = dataframe_merged[column].mean()
    std_dev = dataframe_merged[column].std()
    
    # Calculate the threshold for outliers (beyond 3 times the standard deviation)
    threshold = mean_value + 3 * std_dev
    
    # Identify outliers
    outliers = dataframe_merged[column][dataframe_merged[column] > threshold]
    
    # Calculate the percentage of outliers
    perc = len(outliers) * 100.0 / len(dataframe_merged)
    
    print(f"Column {column} has outliers = {perc:.2f}%")
    print(f"3x Standard Deviation value is: {threshold:.2f}\n")
Column total_population has outliers = 2.20%
3x Standard Deviation value is: 704494.28

Column total_childs_aged_0_to_3_years_i has outliers = 2.52%
3x Standard Deviation value is: 19668.88

Column total_no_of_lactating_mothers_re has outliers = 2.46%
3x Standard Deviation value is: 4963.21

Column total_no_of_beneficiaries_receiv has outliers = 1.93%
3x Standard Deviation value is: 2285.03

Column total_grd_and_pg_in_village has outliers = 2.59%
3x Standard Deviation value is: 34710.17

Column total_no_of_elect_rep_oriented_u has outliers = 1.69%
3x Standard Deviation value is: 2273.61

Column total_trained_trainee_under_skil has outliers = 1.91%
3x Standard Deviation value is: 4905.73

Column total_shg has outliers = 1.41%
3x Standard Deviation value is: 4731.48

As we can see above, the percentage is quite large in each of column (think about: if you eliminate outliers in each of columns, the total data that are erased will be accumulated. At first run, I lost ~10% of the entire data).

As a trade off, we will delete the outliers on column that is related to control variables (the per capita variables): rows with too big total population, total number of immunization on children, and number of lactating mothers receiving ICDS.

Let us try to eliminate rows with total_population > 704494.28. Next, do the same in each of column (with each of > 3x st dev).

In [129]:
dfreg_cut = dataframe_merged[dataframe_merged['total_population'] < 704494.28]
#dfreg_cut = dataframe_merged[dataframe_merged['total_population'] > 5000]
In [130]:
dfreg_cut = dfreg_cut[dfreg_cut['total_childs_aged_0_to_3_years_i'] < 19668.88]
dfreg_cut = dfreg_cut[dfreg_cut['total_no_of_lactating_mothers_re'] < 4963.21]
In [131]:
# How much is reduced?
dfreg_cut.shape
Out[131]:
(5137, 33)
In [132]:
# previous data:
dataframe_merged.shape
Out[132]:
(5327, 33)

Compared to previous size of data: 5327, the final data after outliers reduction is 5137 (already 3,6%). Let us move and get back here if further elimination is needed!

02 - Correlation Mapping

We will use Correlation Mapping to show whether our chosen broad independent variables are good predictors. Since we can not do this with Availability (categorical) variables, we will only examine the numerical variables!

In [133]:
# Declaring which columns that we want to see the correlation
columns_corr = [
    'total_childs_aged_0_to_3_years_i',
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_shg',
    'total_no_of_elect_rep_oriented_u',
]

# Create a subset dataframe with only the specified columns as specified above
subset_corr_df = dfreg_cut[columns_corr]

# Change the name of the variable so that it is understandable by HUMAN!
subset_corr_df.columns = ['Childs (0-3) Immunized','Lact. Mothers with ICDS','PMMVY Benefic.','Hi Edu Level','Under Training','SHS','No of Represent.']

# Create a pair plot
pair_plot = sns.pairplot(subset_corr_df, kind='scatter', diag_kind='hist')

# Change the colour to black (scatter plot)
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
    sns.scatterplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, color='black', ax=pair_plot.axes[i, j])

# Add orange linear regression lines with confidence intervals
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
    sns.regplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, scatter=False, color='orange', ax=pair_plot.axes[i, j])

# Add title
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.02, fontsize=16)

# Show the plot
plt.show()

We can see that some of the independent variables move together with the dependent variable: the total (absolute) number of children (0-3 years) with immunization. We can see better using one representative number below.

In [134]:
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Show the plot
plt.title('Heatmap of Socio-economic Variables related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()

Some important that we get from the heatmap above:

  1. That No of SHGs, People Under Training, and Representatives Chosen have low correlation to other independent variables (compared to other ind variables) making them good candidates of predictors.
  2. No of Child (0-3) with Immunization has hi correlation (>=0.70) with No of Lactating Mothers receive ICDS subsidy, and has moderate correlation with the number of Lactating Mothers receiving ICDS benefits/subsidy and PMMVY.
  3. One of independent variables, has high correlation with other independent variables: Lactating Mothers with ICDS vs PMMVY Beneficiaries. It could be a sign of not good candidate of independent variables. We will keep it this way temporary because we will evaluate this effect on the regression.

Next, let us see the per capita variables that we have derived before!

In [135]:
# Define which variables we want to observe
columns_name = [
    'immunized_child_pct',
    'grd_rat',
    'shg_rat',
    'icds_women_pct',
    'pmmvy_pct',
    'trained_ratio',
    'trained_repr_rat'
]
In [136]:
# Focus on dependent vs all independent variables
subset_corr_df = dfreg_cut[columns_name]
subset_corr_df.columns = ['Childs (0-3) Immunized','Lact. Mothers with ICDS','PMMVY Benefic.','Hi Edu Level','Representatives','Under Training','SHGs']

# Create a pair plot with 'percent_immunized_child' on the y-axis and others on the x-axis
pair_plot = sns.pairplot(subset_corr_df, y_vars=['Childs (0-3) Immunized'], x_vars=subset_corr_df.columns[1:], kind='scatter', diag_kind='hist')

# Show the plot
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.12, fontsize=12)
plt.show()

Hmm... yes, it is not nice looking graph because the correlation seems unclear! What about the correlation?

In [137]:
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Show the plot
plt.title('Heatmap of Socio-economic Variables (per capita) related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()

To make sure the significance, we will also calculate p value. Low p-value (typically under 5%) indicating a significant correlation between the variables. The null hypothesis for a correlation test typically assumes that there is no correlation between the variables.

In [138]:
# Calculate p-values for 'Childs (0-3) Immunized' vs other variables
p_values_list = []

for col in subset_corr_df.columns[1:]:
    _, p_value = pearsonr(subset_corr_df['Childs (0-3) Immunized'], subset_corr_df[col])
    p_values_list.append(('Childs (0-3) Immunized', col, p_value))

# Show it:
for p_value_info in p_values_list:
    print(f"P-value between {p_value_info[0]} and {p_value_info[1]}: {p_value_info[2]:.3f}")
P-value between Childs (0-3) Immunized and Lact. Mothers with ICDS: 0.000
P-value between Childs (0-3) Immunized and PMMVY Benefic.: 0.000
P-value between Childs (0-3) Immunized and Hi Edu Level: 0.000
P-value between Childs (0-3) Immunized and Representatives: 0.000
P-value between Childs (0-3) Immunized and Under Training: 0.036
P-value between Childs (0-3) Immunized and SHGs: 0.000

Reflection

Based on the provided p-values:

Almost all of the chosen independent variables has low p-values (<5%) when tested againts the Percentage of Children with Immunization.

These small p-values support the idea that there is a statistically significant correlation between the number of immunized children (age 0-3 years) and each of the independent variables (accept the H0).

However, some independent variables are having correlation to other independent variables! We will need to evaluate it on next research because there is possibility of bias (foe example, instead of independent variables, there might be dependent variables that will be unveils). For temporary time, let us evaluate those variables using Regression Methods.

Critical DS Framework Reflection

  1. Inclusivity: inclusivity has already taken into account in this analysis by using per capita and all-wide India data.
  2. Bias data: there is significancy based on Person Test, however heatmap showed that some independent variables are not purely independent!

03. One Hot Encoding of Categorical Variables

We have two categorical variables: 'availability_of_adult_edu_centre', 'availability_of_mother_child_health_facilities',

As written at the documentation:

  1. Variable 'availability_of_adult_edu_centre': contains 1 as Yes, 2 as No.
  2. Variable 'availability_of_mother_child_health_facilities': contains 1 as Yes, 2 as No.

Since it is only two categorical variables, basically One Hot Encoding is not necessary. However, as part of non linear process, I found some error during regression process because the '1' and '2' default dtypes is not int. Thus, we will still need to make the One Hot Encoding --> convert to TRUE and FALSE --> convert to int --> do regression.

Further, although One Hot Encoding is mainly for categorical variables that contains >2 class, I will still attach the process here because in the future might need it (for instance, when we will need to put >2 categorical variable: 1, 2, 3, etc).

If we encode categories as 0, 1, and 2, the model might interpret it as an ordinal scale, assuming that the relationship between 0 and 1 is similar to that between 1 and 2. One-hot encoding eliminates this issue by creating binary vectors without any ordinal implication.

In [139]:
categorical_column = 'availability_of_mother_child_health_facilities'

# Performing one-hot encoding
dfreg_encoded = pd.get_dummies(dfreg_cut, columns=[categorical_column])

# If you do not need one hot encoding, just delete below code or add '#':
#dfreg_encoded = dfreg_cut
In [140]:
# Let us see the effect
dfreg_encoded.columns
Out[140]:
Index(['pc11_state_id', 'pc11_district_id', 'pc11_subdistrict_id',
       'subdistrict_name', 'geometry', 'total_population', 'male_population',
       'female_population', 'total_no_of_registered_children_', 'total_hhd',
       'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
       'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
       'total_no_of_beneficiaries_receiv', 'total_no_of_eligible_beneficiari',
       'total_shg', 'total_grd_and_pg_in_village',
       'total_trained_trainee_under_skil',
       'availability_of_mother_child_hea_OLD',
       'availability_of_adult_edu_centre_OLD',
       'total_no_of_elect_rep_oriented_u', 'public_info_board',
       'district_name', 'availability_of_adult_edu_centre',
       'immunized_child_pct', 'grd_rat', 'shg_rat', 'icds_women_pct',
       'pmmvy_pct', 'trained_ratio', 'trained_repr_rat',
       'availability_of_mother_child_health_facilities_1.0',
       'availability_of_mother_child_health_facilities_2.0'],
      dtype='object')
In [141]:
dfreg_encoded.head(3)
Out[141]:
pc11_state_id pc11_district_id pc11_subdistrict_id subdistrict_name geometry total_population male_population female_population total_no_of_registered_children_ total_hhd ... availability_of_adult_edu_centre immunized_child_pct grd_rat shg_rat icds_women_pct pmmvy_pct trained_ratio trained_repr_rat availability_of_mother_child_health_facilities_1.0 availability_of_mother_child_health_facilities_2.0
0 24 468.0 3722.0 lakhpat MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... 62117.708950 32088.750366 30025.930963 4008.570778 12319.391613 ... 2.0 85.0 0.6 0.911 90.2 90.4 0.4 0.1 True False
1 24 468.0 3723.0 rapar MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... 201902.906291 105105.320376 96796.567185 9754.329549 42524.801920 ... 2.0 72.2 1.3 0.168 92.7 73.0 0.6 0.2 True False
2 24 468.0 3724.0 bhachau MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 176504.000000 92678.000000 83460.000000 9545.000000 41865.000000 ... 2.0 63.4 1.5 0.133 83.0 90.4 0.3 0.2 False True

3 rows × 34 columns

From output above, it is clear that there are two new variables: availability_of_mother_child_health_facilities_1.0 and availability_of_mother_child_health_facilities_2.0.

  1. '1' means that there is available health facilities for children and mother.
  2. '2' means that there is no available health facilities for children and mother.

Next, we will change the '1.0' to YES and '2.0' to NO for handy understanding.

In [142]:
# Rename the columns
dfreg_encoded = dfreg_encoded.rename(columns={'availability_of_mother_child_health_facilities_1.0' : 'avail_of_mc_health_facilities_YES', 'availability_of_mother_child_health_facilities_2.0' : 'avail_of_mc_health_facilities_NO'})

# And convert to INT so regression can do the job
dfreg_encoded.avail_of_mc_health_facilities_YES = dfreg_encoded.avail_of_mc_health_facilities_YES.astype(int)
dfreg_encoded.avail_of_mc_health_facilities_NO = dfreg_encoded.avail_of_mc_health_facilities_NO.astype(int)
In [143]:
# Do the same for availability_of_adult_edu_centre variables
categorical_column = 'availability_of_adult_edu_centre'

# Performing one-hot encoding
dfreg_encoded = pd.get_dummies(dfreg_encoded, columns=[categorical_column])

# Rename the columns
dfreg_encoded = dfreg_encoded.rename(columns={'availability_of_adult_edu_centre_1.0' : 'avail_of_adult_edu_centre_YES', 'availability_of_adult_edu_centre_2.0' : 'avail_of_adult_edu_centre_NO'})

# And convert to INT so regression can do the job
dfreg_encoded.avail_of_adult_edu_centre_YES = dfreg_encoded.avail_of_adult_edu_centre_YES.astype(int)
dfreg_encoded.avail_of_adult_edu_centre_NO = dfreg_encoded.avail_of_adult_edu_centre_NO.astype(int)

Notice that we do not delete the first variable that is usually done during the dummy var making stage. It is because we can choose which variable to be put into the Linear Regression instead of deleting it, for the sake of practice and keep the original variables.

04. Train Test Split

Train and Test is split to rule 80:20 because we need 20% of the data as a measure whether our regression result can be applied to 'new cases'.

In order to make sure that the training and test data have appropriate representation of each state (because our data has one observation per subdistrict, it is just make sense that representatives are from district or state level); it would be bad for the training data to entirely miss a state! Thus, we will use stratified test/train split.

Note: In our data, there is also one district that has only one subdistrict data available --> can not use stratified! Hence, use state level instead as the stratify parameter.

In [144]:
# After processing: 
train_data.shape, test_data.shape
Out[144]:
((4109, 35), (1028, 35))
In [145]:
# This process use sklearn library
train_data, test_data = train_test_split(dfreg_encoded, test_size = 0.2, stratify=dfreg_encoded['pc11_state_id'])

05. VIF to Detect Multicollinearity

Multicollinearity can make significant problems during regression model fitting. Here, we will use one of method to reduce multicollinearity called Variance Inflation Factor (VIF). A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression analysis. VIF is used to identify the correlation of one independent variable with a group of other variables.

VIF tells us the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 10 indicates that the existing multicollinearity is inflating the variance of the coefficients 10 times compared to a no multicollinearity model.

VIFs do not have any upper limit. The lower the value the better.

Source: https://www.kaggle.com/code/marcinrutecki/multicollinearity-detection-and-remedies

Based on this interesting source (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6900425/): multicollinearity is present when the VIF is higher than 5 to 10 or the condition indices are higher than 10 to 30. Thus, there is no one fix number as the threshold, but we have threshold around 5 to 10.

In [146]:
# the independent variables:
X_var_vif = dfreg_encoded[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg']]
  
# VIF dataframe:
vif2_data = pd.DataFrame()
vif2_data["Feature"] = X_var_vif.columns
  
# calculating VIF for each feature
vif2_data["VIF"] = [variance_inflation_factor(X_var_vif.values, i)
                          for i in range(len(X_var_vif.columns))]
  
vif2_data
Out[146]:
Feature VIF
0 total_no_of_lactating_mothers_re 5.839053
1 total_no_of_beneficiaries_receiv 4.274775
2 total_grd_and_pg_in_village 3.151885
3 total_trained_trainee_under_skil 2.575056
4 total_no_of_elect_rep_oriented_u 2.071100
5 total_shg 2.480037

Note: the categorical variables is excluded here because results ini error. It happens because the basic of calculating the VIF is determining coef of variance (R^2), while the variance of the categorical variable (for me) has less sense.

We can see from code above that two variables have relatively moderate VIF: total_no_of_lactating_mothers_re (total number of lactating mothers who receive ICDS subsidy) and total_no_of_beneficiaries_receiv (total number of beneficiaries receiving PMMVY subsidy). Basically they are two programs from India's Government, which has impact to health (including immunization level on child). However, I prefer to drop the total_no_of_beneficiaries_receiv because in the context of Immunization level, PMMVY has higher bias than ICDS:

  1. The variable total_no_of_lactating_mothers_re has two powers: ICDS (Integrated Child Development Services) + Lactating Mothers --> make more direct sense to the "Immunization" dependent variable.
  2. The variable total_no_of_beneficiaries_receiv --> receivers can has no children.

Thus, let us drop the 2nd variable and see the VIF.

In [147]:
# the independent variables:
X_var_vif = dfreg_encoded[[
    'total_no_of_lactating_mothers_re',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg']]
  
# VIF dataframe:
vif2_data = pd.DataFrame()
vif2_data["Feature"] = X_var_vif.columns
  
# calculating VIF for each feature
vif2_data["VIF"] = [variance_inflation_factor(X_var_vif.values, i)
                          for i in range(len(X_var_vif.columns))]
  
vif2_data
Out[147]:
Feature VIF
0 total_no_of_lactating_mothers_re 4.283787
1 total_grd_and_pg_in_village 3.148524
2 total_trained_trainee_under_skil 2.542942
3 total_no_of_elect_rep_oriented_u 1.902099
4 total_shg 2.470081

The VIF of others variable above are not changing significantly. Hence, statistically, the multicollinearity is relatively low even if we do not exclude the 'total_no_of_beneficiaries_receiv'.

Part B: Regression Process

The regression process will be done in some activities:

  1. Multivariate Regression on many independent variables vs 1 dependent variables: to narrow down which variables are giving more weight (= more influence).
  2. Linear Regression, one by one each of independent variables and the 1 dependent variables after narrow it down: to make sure that linear regression is appropriate or not.

Multivariate Regression

Multivariate Regression is chosen because we can see how each of variables weighing or impacting the dependent variables.

Let us see the performance of multivariate regression on the non per capita or non rate variables first.

In [148]:
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg',
    'avail_of_mc_health_facilities_YES',
    'avail_of_adult_edu_centre_YES']].copy() 

# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
model.summary()
Out[148]:
OLS Regression Results
Dep. Variable: total_childs_aged_0_to_3_years_i R-squared: 0.764
Model: OLS Adj. R-squared: 0.763
Method: Least Squares F-statistic: 1658.
Date: Fri, 12 Jan 2024 Prob (F-statistic): 0.00
Time: 15:59:53 Log-Likelihood: -36244.
No. Observations: 4109 AIC: 7.251e+04
Df Residuals: 4100 BIC: 7.256e+04
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 309.5529 48.995 6.318 0.000 213.496 405.610
total_no_of_lactating_mothers_re 2.9716 0.049 61.063 0.000 2.876 3.067
total_no_of_beneficiaries_receiv 0.2963 0.081 3.672 0.000 0.138 0.454
total_grd_and_pg_in_village 0.0599 0.005 12.090 0.000 0.050 0.070
total_trained_trainee_under_skil -0.0408 0.025 -1.608 0.108 -0.091 0.009
total_no_of_elect_rep_oriented_u -0.0839 0.056 -1.491 0.136 -0.194 0.026
total_shg 0.2446 0.026 9.501 0.000 0.194 0.295
avail_of_mc_health_facilities_YES -38.6552 57.448 -0.673 0.501 -151.284 73.973
avail_of_adult_edu_centre_YES 939.6405 186.221 5.046 0.000 574.547 1304.734
Omnibus: 1262.807 Durbin-Watson: 2.016
Prob(Omnibus): 0.000 Jarque-Bera (JB): 11077.002
Skew: 1.210 Prob(JB): 0.00
Kurtosis: 10.671 Cond. No. 6.86e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.86e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Well, as we can see above on OLS result:

  1. R-squared is relatively good, it is ~0.76 means that 76% of the variance of the independent variables is explained by the regression model.
  2. SAD! the categorical variables (availability of the facilities and education center for adult) have high coefficient compared to other variables. It indicates that they are not in the same scale (there is scaling problems).

In mean time, it is better one to drop the categorical variables. Reason: if you look at histogram of availability_of_adult_edu_centre, more than 95% subdistricts have no adult education center. Hence, it is basically insignificant.

That will leave one categorical variable: availability of mother and child facilities. This variable will be analyzed to see how much it represents the prediction. Here, because the value is boolean, we will need to scale other variables. Just for handy practice, we will use sklearn for temporary use.

In [149]:
# We will use SKLEARN temporary, because they provide easier use of Scaling (Standard Scaler)
# Define the dependent variable (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']

# Define the independent variables
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg',
    'avail_of_mc_health_facilities_YES'
]].copy()

X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size=0.2, random_state=42)

# Standardize numerical variables
continuous_vars = [
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg'
]

scaler = StandardScaler()
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])
X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])
In [150]:
# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')

# Print the coefficients and intercept
coefficients = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': model.coef_})
print(coefficients)
print('Intercept:', model.intercept_)
Mean Squared Error: 2604582.619278291
R-squared: 0.7779923417120967
                            Variable  Coefficient
0   total_no_of_lactating_mothers_re  2479.581092
1   total_no_of_beneficiaries_receiv   154.788593
2        total_grd_and_pg_in_village   463.233033
3   total_trained_trainee_under_skil   -68.587941
4   total_no_of_elect_rep_oriented_u   -48.137214
5                          total_shg   219.514006
6  avail_of_mc_health_facilities_YES    -6.462896
Intercept: 3997.539148183692

Looking at the result above, the availability of the health facilities has small coefficient. Thus, basically we can ignore it. Let us check first the effect of it on RMSE and R^2. After running program below, here is the result: the R^2 before and after dropping the categorical variable changes insignificantly. Thus, again, basically we can ignore this categorical variable and focus more on numerical ones.

In [151]:
X_train = X_train.drop('avail_of_mc_health_facilities_YES', axis=1)
X_test = X_test.drop('avail_of_mc_health_facilities_YES', axis=1)

# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')

# Print the coefficients and intercept
coefficients = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': model.coef_})
print(coefficients)
print('Intercept:', model.intercept_)
Mean Squared Error: 2604258.90133897
R-squared: 0.7780199345636817
                           Variable  Coefficient
0  total_no_of_lactating_mothers_re  2480.142199
1  total_no_of_beneficiaries_receiv   154.662329
2       total_grd_and_pg_in_village   462.742130
3  total_trained_trainee_under_skil   -68.670872
4  total_no_of_elect_rep_oriented_u   -47.721852
5                         total_shg   219.359304
Intercept: 3995.5178957275484

Well, that makes our effort to combine between Open Government Data (external) of India with our SHRUG useless then?

Of course not. Let us just enjoy the process! We will not "feel" about the difficulty of making linear regression with categorical variables if we never try to fail in this, right?

Also, take a break in checking my assignment, Prof! Our journey is still halfway!

Thanks anyway for reading up until this, here is some icebreaking and what I feel when I add new data but then do not need it after some analysis:

Data Trap

Let us continue to do multivariate regression, this time without putting the Categorical Variables.

In [152]:
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
#y_var /= 1000.0

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg']].copy() 

# Add a constant term to the independent variables matrix
#x_var /= 1000.0
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
model.summary()
Out[152]:
OLS Regression Results
Dep. Variable: total_childs_aged_0_to_3_years_i R-squared: 0.762
Model: OLS Adj. R-squared: 0.762
Method: Least Squares F-statistic: 2194.
Date: Fri, 12 Jan 2024 Prob (F-statistic): 0.00
Time: 16:01:08 Log-Likelihood: -36257.
No. Observations: 4109 AIC: 7.253e+04
Df Residuals: 4102 BIC: 7.257e+04
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 305.3273 45.429 6.721 0.000 216.262 394.393
total_no_of_lactating_mothers_re 2.9507 0.048 61.079 0.000 2.856 3.045
total_no_of_beneficiaries_receiv 0.3486 0.080 4.344 0.000 0.191 0.506
total_grd_and_pg_in_village 0.0613 0.005 12.448 0.000 0.052 0.071
total_trained_trainee_under_skil -0.0349 0.025 -1.375 0.169 -0.085 0.015
total_no_of_elect_rep_oriented_u -0.0973 0.056 -1.735 0.083 -0.207 0.013
total_shg 0.2439 0.026 9.464 0.000 0.193 0.294
Omnibus: 1255.672 Durbin-Watson: 2.019
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10904.492
Skew: 1.205 Prob(JB): 0.00
Kurtosis: 10.608 Cond. No. 1.67e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.67e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [153]:
# Also calculate the RMSE
model.mse_resid**0.5
Out[153]:
1645.336422386731

Result interpretation:

  1. R-squared is relatively good, it is 0.75 - 0.77 (converged around this when I run it multiple times, because the training data also varies in each of run; and it means good!) means that ~76% of the variance of the independent variables is explained by the regression model.
  2. The Condition Number high (see the bottom Notes from the above OLS, thanks to the statsmodel) that probably caused by multicolinearity.
  3. P-Values of the variable 'total_trained_trainee_under_skil' (total people under development training) and 'total_no_of_elect_rep_oriented_u' (No of elected representatives oriented under Rashtriya Gram Swaraj Abhiyan) is bigger than 5%. Therefore, the null hypothesis is accepted, meaning that these variables is not statistically significant in predicting the immunization level (based on Regression OLS method).

Note: Low Condition Number (Close to 1): Indicates low multicollinearity, and the independent variables are not highly correlated. High Condition Number (> 30): Suggests moderate to severe multicollinearity. The interpretation of individual coefficients may be less stable, and the model could be sensitive to small changes in the data. Very High Condition Number (> 100): Indicates a serious problem with multicollinearity. Parameter estimates become highly unstable, and the model may be unreliable.

To reduce the multicollinearity between independent variables, we will try to use the VIF (Variance inflation factors) that we have done before!

Recap the VIF result, variables that makes all VIF < 5:

  • 'total_no_of_lactating_mothers_re',
  • 'total_grd_and_pg_in_village',
  • 'total_trained_trainee_under_skil',
  • 'total_no_of_elect_rep_oriented_u',
  • 'total_shg
In [154]:
# Do regression after judgement using VIF 

# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg']].copy() 

# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
condition_number = model.condition_number
condition_number
Out[154]:
16622.94106327228
In [155]:
# Also calculate the RMSE
model.mse_resid**0.5
Out[155]:
1648.9163346189816

Notice the before after conditions (changing independent variables using VIF), the RMSE does not change much (before: 1645, after 1648)!

Both still have high condition number, indicating that the problem is not collinearity.

After do some research, this inspired me (https://stats.stackexchange.com/questions/243000/cause-of-a-high-condition-number-in-a-python-statsmodels-regression). We can possible explanation of why it has high condition number >30: because we're fitting a line to the points and then projecting the line all the way back to the origin (x=0) to find the y-intercept. That y-intercept will be very sensitive to small movements in the data points.

It happens probably because we have number in thousands. To justify mathematically, we will try to reduce the scale by dividing all variables by 1000 (thus making the units into 'in thousands').

In [66]:
# Dividing by 1000 and see what happense
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i'] / 1000.0

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_trained_trainee_under_skil',
    'total_no_of_elect_rep_oriented_u',
    'total_shg']].copy() / 1000

# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
model.summary()
Out[66]:
OLS Regression Results
Dep. Variable: total_childs_aged_0_to_3_years_i R-squared: 0.757
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 2128.
Date: Fri, 12 Jan 2024 Prob (F-statistic): 0.00
Time: 15:36:06 Log-Likelihood: -7938.0
No. Observations: 4109 AIC: 1.589e+04
Df Residuals: 4102 BIC: 1.593e+04
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.2599 0.047 5.551 0.000 0.168 0.352
total_no_of_lactating_mothers_re 2.9804 0.050 59.275 0.000 2.882 3.079
total_no_of_beneficiaries_receiv 0.3893 0.082 4.728 0.000 0.228 0.551
total_grd_and_pg_in_village 0.0555 0.005 10.914 0.000 0.046 0.065
total_trained_trainee_under_skil -0.0205 0.027 -0.765 0.444 -0.073 0.032
total_no_of_elect_rep_oriented_u -0.1484 0.057 -2.610 0.009 -0.260 -0.037
total_shg 0.2877 0.027 10.685 0.000 0.235 0.340
Omnibus: 1285.769 Durbin-Watson: 2.053
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10386.574
Skew: 1.262 Prob(JB): 0.00
Kurtosis: 10.368 Cond. No. 31.8


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Notice that the R^2 is not changed, but the condition number is reduced and getting closed to 30!

As long as we know the cause of this, we can temporary ignore this (especially it is not because of the multicollinearity problem).

Then, we will not reduce the variables that initially we think affects the collinearity. Instead, we will reduce the variables 'total_trained_trainee_under_skil' (total people under development training) and 'total_no_of_elect_rep_oriented_u' (No of elected representatives oriented under Rashtriya Gram Swaraj Abhiyan) because of two reasons:

  1. The p-value from OLS is >5%.
  2. Based on the coefficient, they are ones that have low weight compared to other variables.

Let us see the effect on the RMSE below.

In [67]:
# Dropping the high-p-value variables and less weight (coefficient)
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'total_no_of_lactating_mothers_re',
    'total_no_of_beneficiaries_receiv',
    'total_grd_and_pg_in_village',
    'total_shg']].copy() 

# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
model.summary()
Out[67]:
OLS Regression Results
Dep. Variable: total_childs_aged_0_to_3_years_i R-squared: 0.756
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 3184.
Date: Fri, 12 Jan 2024 Prob (F-statistic): 0.00
Time: 15:36:06 Log-Likelihood: -36326.
No. Observations: 4109 AIC: 7.266e+04
Df Residuals: 4104 BIC: 7.269e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 248.6636 46.683 5.327 0.000 157.140 340.187
total_no_of_lactating_mothers_re 2.9632 0.050 59.317 0.000 2.865 3.061
total_no_of_beneficiaries_receiv 0.3168 0.078 4.041 0.000 0.163 0.470
total_grd_and_pg_in_village 0.0535 0.005 11.160 0.000 0.044 0.063
total_shg 0.2856 0.026 10.781 0.000 0.234 0.338
Omnibus: 1276.635 Durbin-Watson: 2.056
Prob(Omnibus): 0.000 Jarque-Bera (JB): 10319.017
Skew: 1.251 Prob(JB): 0.00
Kurtosis: 10.349 Cond. No. 1.66e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [68]:
# Also calculate the RMSE
model.mse_resid**0.5
Out[68]:
1672.8305560627273

Notice that the RMSE does not increase after eliminating non-significant variables. Leaving variables below ranked by the weight (coef) that impacting the number of immunization on children aged 0-3 years.

  1. Total number of lactating mothers receiving ICDS subsidy: total_no_of_lactating_mothers_re | coef: 2.9632
  2. Total number of beneficiaries under PMMVY program: total_no_of_beneficiaries_receiv | coef: 0.3168
  3. Total number of community Self Help Groups: total_shg | coef: 0.2856
  4. Total number of graduates and post graduates attain: total_grd_and_pg_in_village | coef: 0.05

What about the per capita variables? Let us do the regression!

In [69]:
# Define the dependent variables (output)
y_var = train_data['immunized_child_pct']

# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
    'icds_women_pct',
    'pmmvy_pct',
    'grd_rat',
    'trained_ratio',
    'trained_repr_rat',
    'shg_rat']].copy() 

# 'availability_of_mother_child_health_facilities'

# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)

# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()

# Print the summary of the regression
model.summary()
Out[69]:
OLS Regression Results
Dep. Variable: immunized_child_pct R-squared: 0.222
Model: OLS Adj. R-squared: 0.220
Method: Least Squares F-statistic: 194.6
Date: Fri, 12 Jan 2024 Prob (F-statistic): 8.23e-219
Time: 15:36:06 Log-Likelihood: -16931.
No. Observations: 4109 AIC: 3.388e+04
Df Residuals: 4102 BIC: 3.392e+04
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 32.7339 1.462 22.389 0.000 29.868 35.600
icds_women_pct 0.4297 0.017 25.758 0.000 0.397 0.462
pmmvy_pct 0.0004 0.013 0.029 0.977 -0.024 0.025
grd_rat 0.6184 0.083 7.432 0.000 0.455 0.782
trained_ratio -1.5371 0.374 -4.113 0.000 -2.270 -0.804
trained_repr_rat -2.5610 0.790 -3.243 0.001 -4.109 -1.013
shg_rat 2.2448 0.286 7.851 0.000 1.684 2.805
Omnibus: 257.458 Durbin-Watson: 1.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 344.522
Skew: -0.568 Prob(JB): 1.54e-75
Kurtosis: 3.850 Cond. No. 737.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Notice that the R^2 is very low (0.2) compared to the non per capita variables before (0.76).

Below, we can plot immunization rate vs total population of children on the same age. Notice that the correlation is unclear:

  1. Not indicating high number of children in an area means higher effort, or
  2. Not indicating low number of children in an area means lower effort thus immunization level is higher.

Therefore, I recommend that we do not use the immunization rate as dependent variable that is affected by the independent variables that we have chosen or narrowed down before.

However, we will use the rate of immunization as dependent variable when comparing to other district/ subdistricts, in which we will do in next Part or Chapter. It is align with recommendation from credible resource [1], stating that Per capita is recommended to be used in following conditions:

  • Involving comparison among cases, involving such measures as "relative deprivation" of one group of people contrasted to that of other populations.
  • The relevant independent variables are actually employed by decision makers.

It would be interesting combining this rate or per capita variable by using this [1] principle and Tobler's First Law of Geography.

Reference: [1] Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513

In [70]:
sns.regplot(x='total_childs_aged_0_to_3_years', y='immunized_child_pct', data=dfreg_encoded,line_kws={'color': 'black'})
plt.title('Regression Plot: Total Children age 0-3 years vs Immunization Level')
plt.xlabel('Total Children Aged 0 to 3 Years')
plt.ylabel('Immunization Level (Percentage)')

plt.show()

Linear Regression

In this sub part, we will focus on the narrowed down independent variable againts the dependent variables. Those are:

  1. Total number of lactating mothers receiving ICDS subsidy: total_no_of_lactating_mothers_re | coef: 2.9566
  2. Total number of beneficiaries under PMMVY program: total_no_of_beneficiaries_receiv | coef: 0.2652
  3. Total number of community Self Help Groups: total_shg | coef: 0.2253
  4. Total number of graduates and post graduates attain: total_grd_and_pg_in_village | coef: 0.0562

The coefficient shows the weight (impact) we got from the Multivariate Regression.

Looking more deeper also means evaluating important variables:

  1. R^2
  2. RMSE
  3. Overfitting or underfitting (by comparing RMSE train vs test)
  4. Visually inspect, whether linear regression is enough.

1. Total number of children with immunization vs Total number of lactating mothers receiving ICDS subsidy

Here, we will use sklearn because it is simpler and usually be used to compare between test and train data.

In [157]:
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
    'total_no_of_lactating_mothers_re']].copy() 
x_test = test_data[[
    'total_no_of_lactating_mothers_re']].copy() 

# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)

# Model making a prediction on test data
y_pred = lm.predict(x_test)

print('R2:', round( lm.score(x_test, y_test),4) )

rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )

rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )

difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.7514
RMSE on training set: 1702.7484
RMSE on test set: 1697.1431
Difference in %: 0.329

When it comes to training models, there are two major problems one can encounter: overfitting and underfitting.

Overfitting: when the model performs well on the training set but not on test data. Underfitting: when it neither performs well on the train set nor on the test set.

In this prediction, the difference between RMSE of training and test dataset is quite small (around 0-2 %), thus it is just right fit. Linear Regression is proper here.

In [158]:
# Set blind-friendly color palette
sns.set_palette("colorblind")

# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)

# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)

# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')

# Add labels and title
plt.xlabel('Total Number of Lactating Mothers')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs Lactating Mothers under ICDS')

# Display legend
plt.legend()

# Display the plot
plt.show()

We can also use a machine learning tools called "Decision Tree Regressor", where there is regression method inside the SKLEARN that also vary the depth of the decision tree.

More depth means more complex. As we can see from graph below, there is underfitting because the more complex the depth (means facing more "unexpected x values", train and test are not converge.

In [73]:
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]  # Try different depths
train_rmse = []
test_rmse = []

for depth in depths:
    # Create and train the model
    dt_regressor = DecisionTreeRegressor(max_depth=depth)
    dt_regressor.fit(x_train, y_train)
    
    # Predictions
    y_train_pred = dt_regressor.predict(x_train)
    y_test_pred = dt_regressor.predict(x_test)
    
    # Calculate RMSE (Root Mean Squared Error)
    train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
    test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()

Now do the same for other variables

2. Total number of children with immunization vs Total number of beneficiaries under PMMVY program

In [74]:
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
    'total_no_of_beneficiaries_receiv']].copy() 
x_test = test_data[[
    'total_no_of_beneficiaries_receiv']].copy() 

# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)

# Model making a prediction on test data
y_pred = lm.predict(x_test)

print('R2:', round( lm.score(x_test, y_test),4) )

rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )

rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )

difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.3939
RMSE on training set: 2601.947
RMSE on test set: 2605.9737
Difference in %: 0.155
In [75]:
# Set blind-friendly color palette
sns.set_palette("colorblind")

# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)

# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)

# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')

# Add labels and title
plt.xlabel('Total Number of Beneficiaries')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs PMMVY Beneficiaries')

# Display legend
plt.legend()

# Display the plot
plt.show()

Compared to previous variable, this one gives bigger difference between RMSE train vs set. Let us try to use polynomial instead, starts with degree of 2.

In [76]:
# Let us try Polynomial Regression!
degree = 2  # You can adjust the degree of the polynomial
poly_features = PolynomialFeatures(degree=degree, include_bias=False)  # Set include_bias=False to avoid constant term
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)

feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]

# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)

# Do the poly regression yeah
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)

# Generate a DataFrame for x_range_poly
x_range = pd.DataFrame(np.linspace(min(x_test.values), max(x_test.values), 100), columns=['total_no_of_beneficiaries_receiv'])
x_range_poly = poly_features.transform(x_range)
x_range_poly_df = pd.DataFrame(x_range_poly, columns=poly_feature_names)

# Model making a prediction on x_range
y_pred = lm.predict(x_range_poly_df)

# Evaluation metrics
print('R2:', round(lm.score(x_test_poly_df, y_test), 4))
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
print('RMSE on training set:', round(rmse_train, 4))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
print('RMSE on test set:', round(rmse_test, 4))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
print('Difference in %:', round(difference_rmse, 3))

# Plot
plt.scatter(x_test.values, y_test, label='Test Data', alpha=0.7)
plt.scatter(x_train.values, y_train, label='Train Data', alpha=0.7)
plt.plot(x_range['total_no_of_beneficiaries_receiv'], y_pred, color='black', label='Polynomial Regression (Degree {})'.format(degree))
plt.xlabel('Total Number of Beneficiaries Received')
plt.ylabel('Total Childs Aged 0-3 Years')
plt.title('Polynomial Regression with Degree {}'.format(degree))
plt.legend()
plt.show()
R2: 0.4008
RMSE on training set: 2581.2909
RMSE on test set: 2591.2985
Difference in %: 0.388

We can see that the R^2 is not significantly difference and the RMSE difference train vs test. We can also see the effect in increasing the degree of polynomial below.

In [77]:
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []

for degree in degrees:
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    x_train_poly = poly_features.fit_transform(x_train)
    x_test_poly = poly_features.transform(x_test)

    # Polynomial feature names
    feature_names = x_train.columns
    poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]

    # Assign polynomial feature names to transformed data
    x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
    x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)

    # Polynomial regression
    lm = LinearRegression()
    lm.fit(x_train_poly_df, y_train)

    # Evaluation metrics
    rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
    rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
    difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100

    # Append RMSE values
    rmse_train_values.append(rmse_train)
    rmse_test_values.append(rmse_test)
    difference_rmse_values.append(difference_rmse)

# Plotting
plt.figure(figsize=(15, 5))

# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)

# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)

# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

For Total number of children with immunization vs Total number of beneficiaries under PMMVY program, the best fit is Linear Regression, because even on higher polynomial regression, the RMSE difference is quite small. I think it is because the data is too spread out. Might need more knowledge and tools (and brains) to overcome this critically.

In [ ]:
 
In [78]:
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]  # Try different depths
train_rmse = []
test_rmse = []

for depth in depths:
    # Create and train the model
    dt_regressor = DecisionTreeRegressor(max_depth=depth)
    dt_regressor.fit(x_train, y_train)
    
    # Predictions
    y_train_pred = dt_regressor.predict(x_train)
    y_test_pred = dt_regressor.predict(x_test)
    
    # Calculate RMSE (Root Mean Squared Error)
    train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
    test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()

3. Total number of children with immunization vs Total number of community Self Help Groups

In [79]:
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
    'total_shg']].copy() 
x_test = test_data[[
    'total_shg']].copy() 

# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)

# Model making a prediction on test data
y_pred = lm.predict(x_test)

print('R2:', round( lm.score(x_test, y_test),4) )

rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )

rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )

difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.1331
RMSE on training set: 3080.4105
RMSE on test set: 3116.7404
Difference in %: 1.179
In [80]:
# Set blind-friendly color palette
sns.set_palette("colorblind")

# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)

# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)

# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')

# Add labels and title
plt.xlabel('Total Number of Beneficiaries')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs No of SHGs')

# Display legend
plt.legend()

# Display the plot
plt.show()

Jumping to the degree of polynomial and how the model performs, it is better to have lower degree for polynomial. It also shows the sign of underfitting in the Decision Tree Regressor below.

In [82]:
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []

for degree in degrees:
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    x_train_poly = poly_features.fit_transform(x_train)
    x_test_poly = poly_features.transform(x_test)

    # Polynomial feature names
    feature_names = x_train.columns
    poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]

    # Assign polynomial feature names to transformed data
    x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
    x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)

    # Polynomial regression
    lm = LinearRegression()
    lm.fit(x_train_poly_df, y_train)

    # Evaluation metrics
    rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
    rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
    difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100

    # Append RMSE values
    rmse_train_values.append(rmse_train)
    rmse_test_values.append(rmse_test)
    difference_rmse_values.append(difference_rmse)

# Plotting
plt.figure(figsize=(15, 5))

# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)

# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)

# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

This one is quite complex, because the difference of RMSE is always changing when I re-run the split test/train. However, since the percentage is quite low (RMSE difference is around 2%), we can say that linear is still proper.

In [83]:
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]  # Try different depths
train_rmse = []
test_rmse = []

for depth in depths:
    # Create and train the model
    dt_regressor = DecisionTreeRegressor(max_depth=depth)
    dt_regressor.fit(x_train, y_train)
    
    # Predictions
    y_train_pred = dt_regressor.predict(x_train)
    y_test_pred = dt_regressor.predict(x_test)
    
    # Calculate RMSE (Root Mean Squared Error)
    train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
    test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()

4. Total number of children with immunization vs number of graduates and post graduates

Last but not least,

In [84]:
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']

# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
    'total_grd_and_pg_in_village']].copy() 
x_test = test_data[[
    'total_grd_and_pg_in_village']].copy() 

# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)

# Model making a prediction on test data
y_pred = lm.predict(x_test)

print('R2:', round( lm.score(x_test, y_test),4) )

rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )

rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )

difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.365
RMSE on training set: 2726.9544
RMSE on test set: 2667.5487
Difference in %: 2.178
In [85]:
# Set blind-friendly color palette
sns.set_palette("colorblind")

# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)

# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)

# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')

# Add labels and title
plt.xlabel('Total Number of Graduates and Post Graduates')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs No of Grads')

# Display legend
plt.legend()

# Display the plot
plt.show()
In [86]:
# Let us try Polynomial Regression!
degree = 4  # You can adjust the degree of the polynomial
poly_features = PolynomialFeatures(degree=degree, include_bias=False)  # Set include_bias=False to avoid constant term
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)

feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]

# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)

# Do the poly regression yeah
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)

# Generate a DataFrame for x_range_poly
x_range = pd.DataFrame(np.linspace(min(x_test.values), max(x_test.values), 100), columns=['total_grd_and_pg_in_village'])
x_range_poly = poly_features.transform(x_range)
x_range_poly_df = pd.DataFrame(x_range_poly, columns=poly_feature_names)

# Model making a prediction on x_range
y_pred = lm.predict(x_range_poly_df)

# Evaluation metrics
print('R2:', round(lm.score(x_test_poly_df, y_test), 4))
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
print('RMSE on training set:', round(rmse_train, 4))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
print('RMSE on test set:', round(rmse_test, 4))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
print('Difference in %:', round(difference_rmse, 3))

# Plot
plt.scatter(x_test.values, y_test, label='Test Data', alpha=0.7)
plt.scatter(x_train.values, y_train, label='Train Data', alpha=0.7)
plt.plot(x_range['total_grd_and_pg_in_village'], y_pred, color='black', label='Polynomial Regression (Degree {})'.format(degree))
plt.xlabel('No of SHGs')
plt.ylabel('Total Childs Aged 0-3 Years')
plt.title('Polynomial Regression with Degree {}'.format(degree))
plt.legend()
plt.show()
R2: 0.402
RMSE on training set: 2638.7538
RMSE on test set: 2588.5381
Difference in %: 1.903
In [87]:
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []

for degree in degrees:
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    x_train_poly = poly_features.fit_transform(x_train)
    x_test_poly = poly_features.transform(x_test)

    # Polynomial feature names
    feature_names = x_train.columns
    poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]

    # Assign polynomial feature names to transformed data
    x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
    x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)

    # Polynomial regression
    lm = LinearRegression()
    lm.fit(x_train_poly_df, y_train)

    # Evaluation metrics
    rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
    rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
    difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100

    # Append RMSE values
    rmse_train_values.append(rmse_train)
    rmse_test_values.append(rmse_test)
    difference_rmse_values.append(difference_rmse)

# Plotting
plt.figure(figsize=(15, 5))

# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)

# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)

# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

This one relation also better to be in lower degree, for example the degree is 4 because it has low absolute RMSE.

In [88]:
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11]  # Try different depths
train_rmse = []
test_rmse = []

for depth in depths:
    # Create and train the model
    dt_regressor = DecisionTreeRegressor(max_depth=depth)
    dt_regressor.fit(x_train, y_train)
    
    # Predictions
    y_train_pred = dt_regressor.predict(x_train)
    y_test_pred = dt_regressor.predict(x_test)
    
    # Calculate RMSE (Root Mean Squared Error)
    train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
    test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))

# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()

Reflection

Technicality: A. Multivariate Regression

  1. We have derived which variables influencing the dependent variable, variables below ranked by the weight (coef) that impacting the number of immunization on children aged 0-3 years.
  • Total number of lactating mothers receiving ICDS subsidy: total_no_of_lactating_mothers_re | coef: 2.9632
  • Total number of beneficiaries under PMMVY program: total_no_of_beneficiaries_receiv | coef: 0.3168
  • Total number of community Self Help Groups: total_shg | coef: 0.2856
  • Total number of graduates and post graduates attain: total_grd_and_pg_in_village | coef: 0.05

Thus, together with correlation and p-values: H1, H2, H3, and H5 is correct.

  1. Because the data does not have clear trend when making it rate or per capita, we will use background theory that it is best to make it percapita when using it in next step: ESDA (comparing between areas).
  2. The multicollinearity is quite low, proven by low number of VIF.
  3. Possible bias: facility should be included since it has strong theoretical beackground (facilities will improve the KPIs). However, since we are limited in the data, there might be more bias that we could not use the Availability of Facility just because we are lacking of knowledge. Thus, further recommendation and discussion with experts is needed.

B. Linear Regression

  1. Looking visually and check the RMSE between train ans test, the linear regression can be judged as "enough" model that represent the trend.
  2. However, there is sign of Underfitting when checking it using Decision Tree Regressor.
  3. This different result (slight different in RMSE but DEcision Tree Regressor shows underfitting) can lead to bias in which model we can properly assign.

How to improve?

  • Use cross validation technique, but requires "longer story".
  • Find converge RMSE data in each of iteration, because every time run the test-train split, the RMSE is slightly difference. Some 3-5% difference if we want to do many iterations. Indicating that we should not only look at one RMSE value.

C. Graphical Excellence

  • Clear purpose
  • Good use of space
  • Color is blindfriendly

Part 05: Exploratory Spatial Data Analysis: Spatial Regression ¶

In this part, main activities are:

  1. Visualization
  2. Autocorrelation Preparation: Spatial Weight, ESDA Global, ESDA Local

This part is mainly to test final Hypothesis, the H6.

H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.

Main idea: to check how certain area is correlated to its neighbour in the context of "percentage of immunization on children (0-3 years)", we will both do Global an Local Autocorrelation, in which use Moran Plot. We will also examine some regression parameter in this case!

1. Spatial Visualization

Let us see a glimpse of what India looks like!

In [89]:
# Load local Authority geometries using pc11_subdistrict_id code as index
map = gpd.read_file(map_path).set_index('pc11_subdistrict_id', drop=False)
ax = map.plot(figsize=(6, 6), alpha=0.3, color='red');
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=map.crs)
ax.set_title('Map of India')
Out[89]:
Text(0.5, 1.0, 'Map of India')

Next, we will see how distributed the Immunization Rate on Children in all whole area of India. It will be easier to understand if we use Choroplets.

Chorophlets that I will make is based on the option "Quantiles". I chose this because, as you can see on Part 02, most of the data are skewed to the left (positive skew), more or less resembles Poisson Distribution with low lambda. This quantiles is suitable for skewed distributions and helps in handling outliers. It provides good representation by placing an approximately equal number of observations in each interval.

In [90]:
num_division = 11 # Divide into how many section
geo_pop = dfreg_encoded
geo_pop = geo_pop.rename(columns={'immunized_child_pct': 'percent_immunized_child'})
classi = mapclassify.Quantiles(geo_pop['percent_immunized_child'], k=num_division)
classi
Out[90]:
Quantiles

    Interval       Count
------------------------
[  3.40,  47.09] |   467
( 47.09,  57.00] |   475
( 57.00,  63.50] |   464
( 63.50,  68.30] |   465
( 68.30,  72.60] |   469
( 72.60,  76.30] |   469
( 76.30,  80.10] |   465
( 80.10,  83.70] |   467
( 83.70,  87.60] |   473
( 87.60,  92.40] |   461
( 92.40, 100.00] |   462
In [91]:
# Set up the figure to explain the Quantiles that I made
f, ax = plt.subplots(1, figsize=(8, 5))
# Plot the kernel density estimation (KDE)
sns.kdeplot(geo_pop['percent_immunized_child'], fill=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(geo_pop['percent_immunized_child'], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
    plt.axvline(cut, color='red', linewidth=0.75)
# Display image
ax.set_title('KDE for Immunized Children (0-3) Variable')    
plt.show()

Graph Excellence

  1. Wise use of space
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Encourage the eye using KDE
In [92]:
# Create Choropleths using the 'cividis' colormap
geo_pop.plot(column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
         cmap='cividis',  # Use cividis colormap
         edgecolor='w', linewidth=0.1, figsize=(8, 8))
         
# Add title with proper centering
plt.title('Choropleths of Immunized Children (0-3) in India, per Subdistricts', pad=20, loc='center')

# Add colorbar with proper positioning
cax = plt.axes([0.95, 0.1, 0.03, 0.8])  # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = []  # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)

# Set colorbar label
cbar.set_label('Percentage of children aged 0-3 years immunized')

# Show the plot
plt.show()

Graph Excellence

  1. Use decent colour and blind friendly: I use one kind of Protanope colour (the name is Cividis)
  2. Clear purpose and clear label on axis and title
  3. Not distorting the data: modification is less to show the bar naturally of the plot.
  4. Wise use of space

Next, we will do Moran Plot, a plot of spatial data against its spatially lagged values, augmented by reporting the summary of influence measures for the linear relationship between the data and the lag. Basically, it is Linear Regression!

Moran Plot is basic component in Global and Local Autocorrelation.

Please note that from here, we will move analysis "up" to district level, because as I have explained before, the "subdistrict" gpkg results in many undesired "island" while creating spatial weight. Let us refresh our brain with the district level dataframe: geo_pop_district.

In [93]:
geo_pop_district.head(3)
Out[93]:
pc11_state_id pc11_district_id district_name geometry female_population male_population total_childs_aged_0_to_3_years total_childs_aged_0_to_3_years_i total_grd_and_pg_in_village total_hhd total_no_of_beneficiaries_receiv total_no_of_elect_rep_oriented_u total_no_of_eligible_beneficiari total_no_of_lactating_mothers total_no_of_lactating_mothers_re total_population total_shg total_trained_trainee_under_skil
0 24 468.0 Kachchh MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... 6.918732e+05 7.659721e+05 52789.062307 36124.140136 29509.874064 332446.747616 3933.575088 2623.179898 4593.834795 7194.496554 6452.874545 1.465846e+06 5262.200653 9285.792694
1 24 469.0 Banas Kantha MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... 1.490841e+06 1.603543e+06 137018.185689 93865.516320 103408.588403 607873.050328 9331.840111 8527.665757 11595.960114 19340.773617 16878.197941 3.273774e+06 13546.601715 24814.951196
2 24 470.0 Patan MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... 5.204153e+05 5.651527e+05 49457.172519 30594.736482 64222.702638 258319.762048 1643.088208 3397.482485 2437.191204 9340.013570 7797.987739 1.089937e+06 14351.448633 30910.742259

Do not forget to calculate the needed variables in Hypothesis!

Here, Finally, we can use per capita or rate of immunization on Children because we will compare between districts in India. Just reminder: this is based on: Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513

In [94]:
geo_pop_district['percent_immunized_child'] = (geo_pop_district['total_childs_aged_0_to_3_years_i'] / geo_pop_district['total_childs_aged_0_to_3_years']).round(3) * 100

Autocorrelation Preparation

a. Creating Spatial Weight

Before doing ESDA on both Global and Local, we need to create spatial weight matrix. Here, we will use "queen" which defines neighbors as those sharing a common boundary or vertex. The "queen" is used because it is less sensitive to irregular shapes (especially regions) in the spatial distribution of the units

In [95]:
variables = 'pc11_district_id'
weight_df = geo_pop_district

# Check for duplicate entries
if weight_df[variables].duplicated().any():
    # If duplicates exist, drop them or handle them as per your data requirements
    weight_df = weight_df.drop_duplicates(subset=variables)

# Set the index
weight_df = weight_df.set_index(variables, drop=False)

# Create spatial weights matrix
w_queen = Queen.from_dataframe(weight_df, ids=variables)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: 
 There are 26 disconnected components.
 There are 6 islands with ids: 71.0, 88.0, 528.0, 638.0, 639.0, 640.0.
  warnings.warn(message)

As you can see above, the number of island is quite low, in which we can ignore for further exploration. The total number of district here is 604, which makes 6 island is only less than 1% (ignore them will not much affect the data).

In [96]:
w_queen.islands
Out[96]:
[71.0, 88.0, 528.0, 638.0, 639.0, 640.0]
In [97]:
# Next drop the islands
weight_df.set_index('pc11_district_id')
weight_df = weight_df.drop(w_queen.islands)

Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:

In [98]:
w_queen = Queen.from_dataframe(weight_df, ids=variables)
w_queen.transform = 'R'
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: 
 There are 20 disconnected components.
  warnings.warn(message)

NOTE that

Also check the plot to see how the neighbours number are spreaded! As we can see on graph below, average number of neighbour in certain districts are around 4.

In [99]:
queen_card = pd.Series(w_queen.cardinalities)
sns.displot(queen_card, bins=10)
plt.title('Distribution of Cardinalities for Queen Contiguity')
Out[99]:
Text(0.5, 1.0, 'Distribution of Cardinalities for Queen Contiguity')

b. Spatial Lag

Next step is we will need to determine the Spatial Lag.

We will create some new variables:

  1. weight of the dependent variable: percent_immunized_child
  2. normalized the variable no 1 above.
In [100]:
weight_df['w_percent_immunized_child'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child'])
In [101]:
# Check whether it works or not :D
weight_df[['pc11_district_id', 'district_name','percent_immunized_child', 'w_percent_immunized_child']].head(5)
Out[101]:
pc11_district_id district_name percent_immunized_child w_percent_immunized_child
pc11_district_id
468.0 468.0 Kachchh 68.4 69.75
469.0 469.0 Banas Kantha 68.5 65.15
470.0 470.0 Patan 61.9 68.45
471.0 471.0 Mahesana 82.7 72.00
472.0 472.0 Sabar Kantha 76.1 72.60

Next, create the normalized variables (std)

In [102]:
weight_df['percent_immunized_child_std'] = (weight_df['percent_immunized_child'] - weight_df['percent_immunized_child'].mean()) / weight_df['percent_immunized_child'].std()
In [103]:
weight_df['w_percent_immunized_child_std'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child_std'])

c. Moran Plot Regression

Finally, we will see how the w_percent_immunized_child_std correlates with percent_immunized_child_std

In [104]:
f, ax = plt.subplots(1, figsize=(6, 6))

# Plot values with regression line and R-squared value
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None, ax=ax,
            scatter_kws={'s': 10, 'alpha': 0.7}, line_kws={'color': 'black'})

# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)

# Display R-squared value
corr_value = weight_df['percent_immunized_child_std'].corr(weight_df['w_percent_immunized_child_std'])
r_squared = corr_value**2
plt.text(0.5, 0.1, f'R-squared = {r_squared:.2f}', ha='center', va='center', transform=ax.transAxes, bbox=dict(facecolor='white', edgecolor='white', boxstyle='round,pad=0.5'))

# Display
plt.title('Moran Plot of Standardized Childs Immunization Percentages')
plt.show()

We can see that the R^2 is quite moderate. It is better than our previous model, for immunization rate that only get R^2 = 20%!

Thus, yes, neighbours affects the tendency to have higher percentage of immunization on children.

In the context percentage of immunization on children above, the Moran Plot can be interpreted along the lines of: districts display positive spatial autocorrelation in the way provide immunization to childrens. This means that the districts with high percentage of immunization tend to be located nearby other districts where a significant percentage of immunization happened, and viceversa.

Other than R^2 which explains "how much the model explains the variance", we will need to evaluate how the regression really fit with the actual data.

Thus, as we have done previously in the middle of the story, we will test the predicted model on test-data-set.

Here, we will use 30-70 rule (basically quite as practical as 80-20)

In [105]:
# Split the data into training and test sets
X = weight_df[['percent_immunized_child_std']]
y = weight_df['w_percent_immunized_child_std']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Do linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Of course we need to next Predict
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# R-2
corr_value = weight_df['percent_immunized_child_std'].corr(weight_df['w_percent_immunized_child_std'])
r_squared = corr_value**2

# RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

# Print R-squared value and RMSE
print(f'R-squared = {r_squared:.2f}')
print(f'RMSE (Train) = {rmse_train:.2f}')
print(f'RMSE (Test) = {rmse_test:.2f}')
R-squared = 0.46
RMSE (Train) = 0.58
RMSE (Test) = 0.59

We can see above that the RMSE does not differ greatly, thus making the Regression Line match with the sample data.

d. Moran I (Global Autocorrelation)

In order to calculate Moran's I in our dataset, we can call a specific function in PySAL directly:

In [106]:
mi = esda.Moran(weight_df['percent_immunized_child'], w_queen)
print("Moran I value = ")
mi.I
Moran I value = 
Out[106]:
0.5382834157404031
In [107]:
moran_scatterplot(mi);

e. Local Spatial autocorrelation

Moran's I provides an indication of overall clustering but doesn't pinpoint the specific locations of clusters.

To identify the spatial distribution at a finer level, local measures of spatial autocorrelation are crucial. Unlike global measures that analyze the entire dataset, local measures operate on individual observations, offering detailed insights instead of summarizing the entire map.

In [108]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-2, 1, "LH", fontsize=25)
plt.text(-1.5, -2.5, "LL", fontsize=25)
# Display
plt.show()

Next identify cases in which the comparison between the value of an observation and the average of its neighbors is either more similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. In order to do so, we need LISA

In [109]:
lisa = esda.Moran_Local(weight_df['percent_immunized_child'], w_queen)

All we need to pass is the variable of interest -percentage of Leave votes- and the spatial weights that describes the neighborhood relations between the different observation that make up the dataset.

Next, to make the map making more straightforward, it is convenient to pull them out and insert them in the main data table (weight_df)

In [110]:
# Break observations into significant or not
weight_df['significant'] = lisa.p_sim < 0.05
# Store the quadrant they belong to
weight_df['quadrant'] = lisa.q

Next, create the significant of the dataframe, followed by the quadrant.

In [111]:
weight_df['significant'].head(20)
Out[111]:
pc11_district_id
468.0    False
469.0    False
470.0    False
471.0    False
472.0    False
473.0    False
474.0    False
475.0    False
476.0    False
477.0    False
478.0    False
479.0    False
480.0    False
481.0    False
482.0    False
483.0    False
484.0    False
485.0    False
486.0    False
487.0     True
Name: significant, dtype: bool
In [112]:
weight_df['quadrant'].head()
Out[112]:
pc11_district_id
468.0    3
469.0    3
470.0    3
471.0    1
472.0    1
Name: quadrant, dtype: int64

The correspondence between the numbers in the variable and the actual quadrants is as follows:

  • 1: HH
  • 2: LL
  • 3: LH
  • 4: HL

HH (High-High): Explanation: Districts in the High-High quadrant have high immunization percentages and are surrounded by other districts with high immunization percentages. Conclusion: This indicates spatial clustering of high immunization rates, suggesting a positive spatial autocorrelation. High-performing districts are geographically close to other high-performing districts.

LL (Low-Low): Explanation: Districts in the Low-Low quadrant have low immunization percentages and are surrounded by other districts with low immunization percentages. Conclusion: This suggests spatial clustering of low immunization rates, indicating a negative spatial autocorrelation. Districts with low immunization rates tend to be close to other districts with similarly low rates.

LH (Low-High): Explanation: Districts in the Low-High quadrant have low immunization percentages but are surrounded by districts with high immunization percentages. Conclusion: This pattern indicates spatial outliers where districts with low immunization rates are in close proximity to districts with high rates. It may be an area of concern or a unique spatial pattern.

HL (High-Low): Explanation: Districts in the High-Low quadrant have high immunization percentages but are surrounded by districts with low immunization percentages. Conclusion: Similar to LH, this pattern represents spatial outliers where districts with high immunization rates are in close proximity to districts with low rates. This, too, could be an area of concern or a unique spatial pattern.

Next, With these two significant and quadrant, we can build a typical LISA cluster map:

In [113]:
plot_local_autocorrelation(lisa, weight_df, 'percent_immunized_child');

Reflection

Regarding to H6:

  1. Visually, based on Moral Plot above: the outliers (HL and LH) are relatively small (clusters are also small).
  2. Checking the significance using Moran I (see below code), result: Spatial autocorrelation is statistically significant. meaning that H6 is accepted.
  3. Model of Regression of the Moran Plot is well fit, looking at R^2 (increased than our previous model of Immunization Rate) and the RMSE train vs test.
In [114]:
if mi.p_sim < 0.05:
    print("Spatial autocorrelation is statistically significant.")
    if mi.I > 0:
        print("Positive spatial autocorrelation (HH or LL).")
    else:
        print("Negative spatial autocorrelation (LH or HL).")
else:
    print("No statistically significant spatial autocorrelation detected.")
Spatial autocorrelation is statistically significant.
Positive spatial autocorrelation (HH or LL).

Principle of excellence of Graph:

  1. Clear purpose and clear label on axis and title (I do not use the variable name in the dataframe, but use true name of the variable).
  2. Select the best graph type (use histogram plot to see distribution and tendency)
  3. Clear colour and easy to read

Part 06: Conclusion and Reflection ¶

After taking this long story, I would like to thanks to Mr. Trivik and TAs for answering my question in lab! and of course reading my story here.

Conclusion

Some hypothesis in this preliminary research are tested, but all of them needs to be evaluated with expertise such as Trivik to fully delete possible biases. So far:

  1. H1, H2, H3, and H5 is accepted (statistically correct) because of significance (via Pearson Correlation test and p-value of the Multivariate Regression Test). Method: plotting from CSV data from SHRUG.
  2. H4 is rejected because it has low significance, tested by multivariate regression focusing on its R2 and RMSE.
  3. H6 is accepted because it has significance (via Pearson Correlation test on Moran I, and the performance of the regression is relatively high compared to using regression in Part 04 (againts other independent variable, not spatial variables)).
  4. Government Subsidy that is related to Immunization on Children aged <3 years has more significance than other variables. Method: Multivariate Regression.

Reflection using Critical Data Science Framework

Important notes to the user of this preliminary research, in context of "Communicat e Findings Transparency and accessibility of the results"

  1. There are potential biases occurs here, because we assume that the independent variables are represented by data from the SHRUG.
  • Education factor represented by Percentage of Graduates and Post Graduates in certain area
  • Economics condition represented by Percentage of Women Lactating that receives subsidy ICDS in certain area
  • Economics condition also represented by PMMVY (as I have stated at the start, if possible, I want to include the income or poverty condition instead).
  • Since both strong weight are the program related to immunization by the Government, they both might really be correlated to each others.
  • Social condition represented by Percentage of SHGs per capita
  • Effect of neighbouring districts only count in variable "children immunization percentage/ coverage".
  • Error when making the variable rate of immunization, because of unclear trend with other variables. Theoretically, instead of absolute number (like we have done here), we should use how much coverage children are immunized. However, the problem can be very complex and need further CASE STUDY or go to India to deep observe. Example of complexity: Government can easily push immunization coverage if the children number is low, but that is not necessary the case as we found that some bigger population also have big coverage.
  1. Potential biases from REGRESSION methods used here:
  • Although the multi-colinearity is statistically low (and checked using VIF), the effect of the presence of government subsidy is stronger than other variables. This should be confirmed by authority and deeper observation. Thus, it is a lot better if we have CASE STUDY to understand more this complexity.
  • Comparing RMSE between train and test, it shows less underfitting. However, checking using Decision Tree Regressor, the underfitting exist. This needs to be judged by experts, since it is quite complex problems. Much better if we can compare the model with certain years, the thing that we could not get from this modul that we have chosen: Ministry of Rural Development, Government of India. Mission Antyodaya (2020).
  • Some data is spread out instead of having linear trend, thus evaluation not only in linear regression is needed.
  • We have run the train-test split only some number of times, while we should check each iteration and find the converge of RMSE, R^2, and other parameters. It leads to bias if we only see one value while there is huge number of iteration can be made.
  1. Inclusivity:
  • The data used for testing H1-H6 used all data in global India (less than 5% was cut because of the outliers). Thus reflects the representativeness.
  • Every code is explained and every graph has good practice of excellence, hoping to increase "explainability".
  • Data is from opensource, and code here is accessible, thus expected to increase transparency.
  • Transparency: no data is hidden to "protect", most of the cases are objective. The external source and theory is also provided here, so you can access and download them. So, this story here is very transparent!
  • Train Test data during main of the process: regression, includes stratify methods, so every States in India is represented.
  1. Inequality:
  • The data was used as inclusive as possible, but please consider that it does not necessary mean the data is just.
  • Possibility to use findings in good way: used by government to push more on subsidy (ICDS and PMMVY) because it has significant correlation with immunization level of children 0-3 years.
  • Possibility to use findings in bad way: stigmatize or unfairly label specific communities or demographic groups, which can lead to discrimination and social divisions. Also: exploiting research findings for marketing purposes (sell vaccines to certain area), potentially leading to the promotion of unproven or unsafe immunization-related products or services.
  1. Participation:
  • All graphs has at least 3 practice of graphical excellence, hoping to be understandable to all of stakeholders.
  • All models are evaluated properly using R2, RMSE, and VIF.
  1. Power:
  • Possibilities to use findings in bad way politically: politicizing research findings to gain support or advance political agendas, leading to inappropriate policy decisions that may not prioritize public health.
  1. Positionality:
  • I am researcher (well, in this case, start to research as preliminary researcher) want to unveil the data so that government can push the number of immunization of children 0-3 years because health is critical factors for human life.
  1. The result can be enhanced? How? Yes, we can enhance the data especially all factors of H1, H2, H3, H5.
  • By choosing direct reprenting column instead of proxy.
  • Analyze geographical feature. Some districts with low amenities (health related) need to be analyzed its nature (mostly mountains so that not much of people?) instead of taking it out.
In [ ]: